NUMBAT OER - Open Educational Resources

2. Mean of a set of observations of a continuous variable

The simplest descriptive statistic is the arithmetic mean, or simply the 'mean'. This is calculated as the average of a series of observations of a continuous variable (ie one where any value can occur). If a sample consists of several observations x1 ... xn, then the mean is calculated as:

mean x = Σ( x1 ... xn)/n

where the expression Σ( x1 ... xn) is the sum of all of the n observations in the series x1 ... xn. (If this notation is difficult for you, refer to the helpsheets on variables and parameters and equations.)

Here is an example data set, with a sample comprising ten observations:

Observation 1 2 3 4 5 6 7 8 9 10
Value 1.69 1.55 2.36 1.73 0.89 1.39 1.79 2.58 1.21 2.10

Σ( x1 ... xn) = 17.29

n = 10

mean x = Σ( x1 ... xn)/n = 1.73

We introduced the mean as the simplest descriptive statistic, suggesting that there are other ways to represent the 'average' for a sample. So what determines when it is appropriate to calculate a mean value, and more importantly when not to? Consider a second set of observations:

Observation 1 2 3 4 5 6 7 8 9 10
Value 1.37 1.45 1.23 1.67 3.19 1.39 1.41 1.27 2.10 4.24

If we group observations into intervals of 0.5, six of the ten fall in the interval 1.0 to 1.49. Yet if we calculate a mean value using the formula given above, the value is 1.93, which falls outside the interval containing most of the observations. This is because three values are greater than 2, with the highest being 4.24. If we plot the distribution of these observations, it is clear that the high values have a disproportionate effect on the calculation of the mean:

Frequency bar chart

Such a distribution of observations is termed 'skewed', where most of the observations are clustered in one part of the range but there are a few points that form a 'tail' to one side or other of the main group. This is opposed to a symmetrical distribution, such as the classical normal distribution, where most of the points cluster close to the middle of the overall range, and outliers are distributed equally on either side.

This demonstrates that a mean is an ideal representation of the average value for group of observations with a normal distribution, or a good approximation to it, but fails to represent the average value for the sample if the distribution of observations is skewed or otherwise markedly different from normality.

The mean of a set of observations in a spreadsheet can be calculated easily using the 'AVERAGE' function. The expression '=AVERAGE(C1:C9)' calculates the mean of the values in cells C1 to C9 of column C, whilst '=AVERAGE(B5:F5)' calculates the mean of the values B5 to F5 in row 5.