Monday, March 15, 2010

Measure of Center: Mean Vs. Median

So now we know that to measure center of a distribution we have two most common numerical measures; Mean and Median.
The mean describes the center as an average value, where the actual values of the data points play an important role since we are summing values and then dividing by count (number of values). The median, on the other hand, locates the middle value as the center, and the order of the data is the key to finding it (here we are sorting and only counting).

Lets understand difference between two with an example.
Assume we have following two data sets:
Data set A -> 64 65 66 68 70 71 73
Data set B -> 64 65 66 68 70 71 730
Please observe that only last value changes in these two sets, ie. 73 in data set A becomes 730 in data set B.

For data set A, the mean is 68.1, and the median is 68. Visually comparing these two data sets we know that the observation 730 is very large and is certainly an outlier. In this case the median is still 68, but the mean will be influenced by the high outlier, and shifted up to 162. The message that we should take from this example is: The mean is very sensitive to outliers (as it factors in their magnitude), while the median is resistant to outliers.

Types of distributions and mean and median:
Lets see what happens to mean and median with our 3 basic distributions.
  • Symmetric distributions with no outliers: Mean (X̄) is approximately equal to median (M).
  • Skewed right distributions and/or datasets with high outliers: Mean (X̄) is always greater than median (M). (X̄ > M)
  • Skewed left distributions and/or datasets with low outliers: Mean (X̄) is always less than median (M). (X̄ < M)


Let's Summarize
  • The two main numerical measures for the center of a distribution are the mean (X̄) and the median (M). The mean is the average of the values, while the median is the middle value.
  • The mean is very sensitive to outliers (as it factors in their magnitude), while the median is resistant to outliers.
  • The mean is an appropriate measure of center only for symmetric distributions with no outliers. In all other cases, the median should be used to describe the center of the distribution.

2 comments:

Unknown said...

v nice explanation of mean vs median.
esp the fact that you used a very large number and showed how mean can vary from median is commendable..

I look forward to more articles from you on stats

Cheero !
Ankit Arora
www.ankitarora.net

Nginx said...

Excellent article, very clearly explained with diagrams etc. Thank you.