May 20 2007

Statistics 101

Published by Wendi at 8:34 pm under statistics

Hello & Welcome!

There are a lot of web analytics tool packages available in the market and some even come with *advanced* metrics built right into the interface. You can refer to these metrics as statistical measurements that explain user behavior that can even predict what they will do next. But what do they all mean and how can I get the best insights from what I have access too?

In this blog I want to take the statistician’s approach to web analytics and discuss how anyone can leverage data currently available to dig deeper and pull out amazing things. I want to look beyond the basics of web analytics and develop a way of thinking that is beyond the canned reports included in the packages.

Stat 101

In this first post I’d like to cover a few very basic metrics that every analyst should embrace that will be used to build on in the future. In most introductory courses of statistics one of the first things you learn are the types of methodologies, one being Descriptive Statistics. In descriptive statistics you look at the data set to describe what it looks like and try to describe the basic trends in the data. There are several steps to this methodology that include collecting the data, summarizing the data, understanding the underlying distribution, and visually displaying the data in a graph. I won’t go into detail for all these steps in this post but rather I’ll focus on the summary statistics that are used in this process. To summarize the data you look to understand the location, dispersion and shape of the distribution.

Six basic statistical measures used in summarizing the location or central tendency of the data set are:

· Minimum = Smallest value in the data set; X(1) where {X(1), X(2), …, X(n)} is the ordered data set

· 1st Quartile (25th Percentile) = X.25; data value at the boundary of 25% of the data

· Mean = (average) Σ(X1,X2, ….. ,Xn) / N

· Median (50th Percentile) = if N is even then Xn/2; if n is odd (Xn-1/2 + Xn/2)/2; data halfway through the ordered data

· 3rd Quartile (75th Percentile) = X.75; data value at the boundary of 75% of the data

· Maximum = Largest value in the data set; X(n) where {X(1), X(2), …, X(n)} is the ordered data set

Application

Now that you know a few new metrics too look at like 1st and 3rd quartiles; why would you look at them and what are they telling you? For example, let’s take “Time on Page”. Say you have a sample data set that looks like the following:

Day Time on Page
1 2.45
2 1.75
3 2.66
4 4.98
5 1.69
6 1.89
7 2.33
8 2.48
9 2.15
10 2.01

Summary Statistics:

Minimum 1.69
Q1 2.21
Mean 2.44
Median 2.24
Q3 2.41
Maximum 4.98

As you inspect the data it maybe easy enough to see a large spike of time on site in Day 4; but think if this data set was real and was 100 X’s the volume. Then a quick glance of the data isn’t so easy any more. This is where summary statistics comes in handy.

Looking at the summary statistics, the first thing that should come to mind is ‘why is the maximum so high compared to the other values? This should lead to deeper inspection of the site metric and review of any major changes to determine the large spike in Day 4. Maybe there was a special campaign, press release, or release of new features that caused this large value. In any case, knowing why there was a large jump in a particular metric can build a path for better insights. This can even bring up the notion of ‘outliers’ but I’ll leave that for another discussion.

Till’ next time…. I wish you safe analyzing.

7 Responses to “Statistics 101”

  1. judahon 20 May 2007 at 8:56 pm

    Welcome to the blogosphere! As a fellow stats geek, I am very much looking forward to reading your future posts. In fact, I’m excited to see what you’ll choose to discuss!

    Best,
    Judah

  2. Jacques Warrenon 23 May 2007 at 5:17 am

    Hi Wendi,

    welcom aboard. There is a need of deep statistical discussions in the web analytics blogosphere. I look forward to your posts.

    Please, contact me, I would like to discuss a little technical matter.

  3. Sébastien Brodeuron 23 May 2007 at 8:45 am

    Why the median is needed? What does it tell me?

  4. Wendion 23 May 2007 at 9:57 am

    Hi Sebastien, The median is a good measure of the center of your data. Many people believe that the mean (average) is the center but if your data is left or right skewed this will give a false sense of centrality. It is a good practice to compare median vs. mean to see where the mean is in relation.
    ex. If you know that half of your visitors (median point) are staying on your site 1.5 mins or less but your average is 5 mins this may cause additional inspection into your data to understand why there is such a huge shift between these two metrics. It could be that there were one or two visitors that stayed on your site for a large amount of time for some reason…. a reason you might want to know.
    Thanks for the question and let me know if I didn’t answer it to what you expected.
    Regards, Wendi

  5. Wendion 23 May 2007 at 2:03 pm

    Hi Jacques, We obviously both have great taste in picking templates. :-)

    Thanks for contacting me.

    Wendi

  6. Jacques Warrenon 23 May 2007 at 2:34 pm

    Well, now I want your new one ! :-)

    I am going to LOVE reading your blog. As an avid reader, I would really appreciate if you could recommend 3 or 4 good books on stats. Maybe on your next post? I did some courses at university, but since I graduated in sociology back in 1985, I’m kinda rusty with all that stuff.

  7. Sébastien Brodeuron 29 May 2007 at 12:09 pm

    Thank Wendi, this is helping me to demystify the two.

    I will second Jacques (hello Jacques by the way) and ask the same thing, any good book you can suggest?

Trackback URI | Comments RSS

Leave a Reply

You must be logged in to post a comment.