Pooled variances

1. Pooled variances

Experimenters in general are weIl advised to subdivide lengthy measurements into a number of shorter ones, whenever feasible. This gives them the possibility — should anything have gone wrong — at least to be aware of the problem and to know roughly when it occurred. If the results do not indicate an anomaly, the partial results must be combined in such a way that the outcome characterizes the entire measuring period.

Problems with recent measurements have led us to re-consider the question of how experimental values for the variance, obtained in sequences of runs, should be used to arrive at a “best” overall estimate.

This problem occurs rather frequently and, no doubt, has been treated many times. A few years ago we looked at the special case of two samples [1]. In what follows these earlier findings are generalized to include an arbitrary number m of samples. It is shown that the contributions arising from the differences between the individual mean values can be incorporated in those terms that involve only the measured variances.

Suppose that we have to deal with a situation which is “under statistical control”. What we actually require is that the first two moments of the number $x$ of events, counted in a given time interval, show no significant trend with time.

Now consider measurements that have been performed on $m$ samples, of size $n_{j} (j = 1, 2, \dots, m)$ , taken randomly from sorne stable population. Let the results be available in the form of me an values ${\bar{x}}_{j}$ and variances $s_{j}^{2}$ (for a single measurement), which have been obtained by forming from the measurements $x_{i j}$ the quantities

${\bar{x}}_{j} = \frac{1}{n_{j}} \sum_{i = 1}^{n_{j}} x_{i j}$ (1)

and

$s_{j}^{2} = \frac{1}{n_{j} - 1} \sum_{i = 1}^{n_{j}} {(x_{i j} - x_{j})}^{2} .$ (2)

This allows us to form the weighted mean value

$\bar{\bar{x}} = \frac{1}{N} \sum_{j = 1}^{m} n_{j} x_{j},$

where

$N = \sum_{j = 1}^{m} n_{j}$

is the total number of measurements performed. Note that all the variances Equation (2) refer to a single measurement $x_{i j}$ , not to a mean value ${\bar{x}}_{j}$ .

Our problem is to evaluate, on the basis of the data given in Equation (1) and Equation (2), a value for the variance of a single measurement, denoted by $s^{2} (x)$ . According to the definition of a variance, we have, considering all the $N$ measurements performed, the estimation

$s^{2} (x) = \frac{1}{N - 1} \sum_{j = 1}^{m} \sum_{i = 1}^{n_{j}} {(x_{i j} - \bar{\bar{x}})}^{2} .$ (3)

This expression will now be rearranged. In a first step we easily find

$(N - 1) \cdot s^{2} (x) = \sum_{j = 1}^{m} \sum_{i = 1}^{n_{j}} {[(x_{i j} - {\bar{x}}_{j}) + ({\bar{x}}_{j} - \bar{\bar{x}})]}^{2}$

$= \sum_{j} \sum_{i} [{(x_{i j} - {\bar{x}}_{j})}^{2} + {({\bar{x}}_{j} - \bar{\bar{x}})}^{2} + 2 (x_{i j} - {\bar{x}}_{j}) ({\bar{x}}_{j} - \bar{\bar{x}})] .$ (4)

Since according to Equation (1)

$\sum_{i = 1}^{n_{j}} (x_{i j} - {\bar{x}}_{j}) = 0,$

we arrive, by using Equation (2), at the form

$s^{2} (x) = \frac{1}{N - 1} \sum_{j = 1}^{m} {[(n_{j} - 1) s_{j}^{2} + n_{j} {({\bar{x}}_{j} - \bar{\bar{x}})}^{2}]}^{2} .$ (5)

This is a useful identity. For the special case of $m = 2$ samples, it follows from Equation (1) that

${\bar{x}}_{1} - \bar{\bar{x}} = {\bar{x}}_{1} - \frac{1}{N} (n_{1} {\bar{x}}_{1} + n_{2} {\bar{x}}_{2}) = \frac{n_{2}}{N} ({\bar{x}}_{1} - {\bar{x}}_{2}),$

and similarly

${\bar{x}}_{2} - \bar{\bar{x}} = \frac{n_{1}}{N} ({\bar{x}}_{2} - {\bar{x}}_{1}) .$

Hence, the relation Equation (5) takes the form

$s^{2} (x) = \frac{1}{N - 1} [(n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2} + \frac{n_{1} n_{2}}{N} {({\bar{x}}_{1} - {\bar{x}}_{2})}^{2}],$ (6)

in agreement with a result given in [1].

This procedure can be taken a step furthero To see this, let us recall the general formula for the variance $s^{2} (\bar{y})$ of a weighted mean value $\bar{y}$ , based on m measured results $y_{k}$ with statistical weights $g_{k}$ . This expression is known to be given by

$s^{2} (\bar{y}) = \frac{\sum_{k = 1}^{m} g_{k} {(y_{k} - \bar{y})}^{2}}{(m - 1) \sum_{k = 1}^{m} g_{k}} .$ (7)

Since in our case, i.e. for the measurement ${\bar{x}}_{j}$ the sample sizes $n_{k}$ correspond to the weights $g_{k}$ we have the relation

$\sum_{j = 1}^{m} n_{j} {({\bar{x}}_{j} - \bar{\bar{x}})}^{2} = (m - 1) \sum_{j} n_{j} \cdot s^{2} (\bar{\bar{x}})$

$= (m - 1) \cdot s^{2} (x),$ (8)

as $\sum_{j} n_{j} = N$ and $s^{2} (\bar{\bar{x}}) = s^{2} (x) / N$ .

Note, however, that the last equation is not an identity: it is a statistical relation. Thus, it does not always hold numerically. It is true on the average and requires that experimental conditions do not change.

If we take advantage of Equation (8), Equation (5) can be brought into the form

$(N - 1) \cdot s^{2} (x) = \sum_{j} (n_{j} - 1) \cdot s_{j}^{2} + (m - 1) \cdot s^{2} (x),$

$s^{2} (x) [N - 1 - (m - 1)] = \sum_{j} (n_{j} - 1) \cdot s_{j}^{2} .$

Hence, we arrive at the general formula

$s^{2} (x) = \frac{\sum_{j = 1}^{m} (n_{j} - 1) \cdot s_{j}^{2}}{N - m},$ (9)

which allows us to obtain the required variance from those measured in the $m$ samples. This form is quite remarkable in that the experimental mean values ${\bar{x}}_{j}$ , present in Equation (5), have disappeared.

For samples of equal size ( $n_{j} = n$ ) we are readily led to the expression

$s^{2} (x) = \frac{1}{m} \sum_{j = 1}^{m} s_{j}^{2},$ (10)

for which even a knowledge of the sample size is no longer required.

A comparison of Equation (10) with Equation (3) shows that, at least for equal groups of measurements obtained in stable conditions, pooled estimates of both the expectation value and the variance are obtained by simply forming arithmetical means. The reader, in hindsight, should feel free to find this result either trivial or quite remarkable. It is likely, although not proven, that analogous relations hold for the central moments of higher order.

References

[1] J.W. Müller: “Moments d’échantillons superposés”, BIPM WPN-215 (1980).