Census this, sample that

Written by B Narayanaswamy

October 21, 2010 22:42 IST

Censuses are rare. The only estimates that we have generally come from sample surveys. Rare though a census may be, let us assume for a moment that we do a census of all pupils in classes 10, 11 and 12 in a school. We ask them how many hours a day they spend on Facebook. Let us say the results of this hypothetical census are as follows. First, all the kids have a Facebook account. Second, the time spent ranges from 0-6 hours. There is just one kid who says he spends 6 hours. The next highest is 4 hours, at and around which there is a cluster of kids. Third, the average time spent is two and a half hours. This is from a census. Would we still have got the same results had we only drawn a sample instead?

Intuition tells us that the estimate from the sample should hover within some range around the ?truth? as yielded by a census. A theorem called the Central Limit Theorem tells us that there is actually a pattern to this hovering around of the sample estimates around the unknown population estimate.

The Central Limit Theorem tells us, specifically, that if we were to estimate the average repeatedly by taking different samples each time, then the estimates of the average from these different samples will fall on a normal distribution. And it will form a bell curve centered on the population estimate.

Note that we are talking about all possible samples of a given size drawn from a population of a much larger size. There would be a large number of samples whose average is somewhere in the middle. There will be whacko samples whose averages are in the extremes but there will only be a few of these.

Continuing with the school example, it?s intuitively obvious that all the kids in any sample will have an account anyways. After all, the census says all the kids have a Facebook account. But we would expect differences on the ?time spent?. May be the sample includes that outlier kid who spends 6 hours. May be it does not. But that will make a difference to the estimate of the average time spent. But in the course of work in our daily life, we usually have only one sample and its estimate. And, we do not know which particular one this sample is, out of the myriad samples that make up the Theorem?s Bell Curve.

How do we then arrive at an estimate of the population average? We qualify the estimate from the sample with a plus or minus figure to arrive at a range. The true population estimate lurks somewhere within this range. The plus or minus figure is the sampling error.

Of course, the sample size has a huge role to play in this. Sample size is analogous to the mega pixels in a camera. The higher it is, the more one can ?zoom? into the picture?analyse the sub-samples and the various slices or dices of the sample. In a real sense, this yields precision. Moreover, it?s only when the sample size is 30 or thereabouts that the curve of the sampling distribution of the mean starts to assume a bell-like shape.

There is a second angle to this, just as in a camera there is the sensor size apart from the mega pixels. The ?range? above can be narrow or wide, depending not only on the sample size but also on a second factor: the confidence level typically expressed as a percentage, 95% for example.

So, there are two parts to the precision of a sample estimate: the confidence level and the width of the error band. If we need both a high confidence level and a narrow plus or minus band ? yes! We need a large sample size. That?s how the threesome links up?the confidence level, the width of the error band and the sample size. So, for any given confidence level (say 95%), as the sample size increases it will yield estimates whose precision band is progressively narrower.

In sum, although we have only a sample survey estimate, and do not even know where exactly this sample resides on the Central Limit Theorem?s bell curve, the Theorem still enables us to provide a population estimate, albeit as a range centered on the sample estimate.

As always, common sense is important. We can be 100% certain in the Facebook example above with a not-very-helpful observation that the population average is somewhere within the extremes (0-6 hours). And I?m sure the parents of such teenagers will aver that it?s a hypothetical census, or else demand a recount. I mean, like, just one kid who spends 6 hours a day on Facebook!

?The author is president, Ipsos Indica Research. Views are personal

This article was first uploaded on October twenty-one, twenty ten, at forty-two minutes past ten in the night.