Sampling, Bias & Inference
You can’t measure every person, test every product, or survey every voter. Instead, you take a sample — a smaller group — and use it to draw conclusions about the whole population. But how big does the sample need to be? And what can go wrong?
Part 1: The Population Distribution
Imagine a population with some average value (the population mean mu) and some natural variation (the population standard deviation sigma):
This is the true distribution. In real life, we usually don’t know its exact shape — that’s the whole point of sampling. We’re trying to learn about this curve by taking samples.
Part 2: Sampling Distribution of the Mean
When you take a sample of size n and compute its mean, that sample mean is itself a random variable. If you repeated the sampling many times, the sample means would form their own distribution — the sampling distribution.
The Central Limit Theorem (CLT) tells us something remarkable:
The sample means center on the population mean, but their spread shrinks as the sample size increases!
Drag the sample size slider and watch the green curve:
- n = 1: The sampling distribution IS the population — no improvement
- n = 4: The spread is halved (sigma/sqrt(4) = sigma/2)
- n = 25: The spread is 1/5 of the original
- n = 100: The spread is 1/10 — sample means cluster tightly around mu
This is the magic of the CLT: larger samples give more precise estimates!
Part 3: The Square Root Law
Notice how increasing n from 1 to 4 helps a lot, but going from 25 to 100 helps less dramatically? That’s because the spread shrinks with sqrt(n), not n itself:
Diminishing returns: To cut the standard error in half, you need to quadruple the sample size. Going from n=100 to n=400 gives the same improvement as going from n=1 to n=4. This is why polling organizations can survey only 1000 people and get accurate results — but surveying 4000 people doesn’t improve accuracy by 4x, only by 2x.
Part 4: Bias — When Samples Mislead
Even with a large sample, your results can be wrong if the sample is biased — systematically unrepresentative of the population.
A biased sample has its center shifted away from the true population mean:
When bias is 0, the sample is centered on the true population mean — that’s a representative sample. As you increase the bias, the sample’s center moves away.
No amount of increasing sample size fixes bias! A biased sample of 10,000 is still wrong. Random sampling is the key to avoiding bias.
Part 5: Confidence — Precision vs. Sample Size
As sample size grows, we become more confident about where the population mean is. A confidence interval gets narrower with more data:
Challenge: A polling company wants to estimate voter support with a margin of error of 3 percentage points (sigma ~ 50 for percentages).
The margin of error is approximately 1.96 * sigma / sqrt(n). Set up the equation: 3 = 1.96 * 50 / sqrt(n). Solve for n.
How many voters do they need to survey?
Wrapping Up
| Concept | Key Idea |
|---|---|
| Sampling distribution | Distribution of sample means from repeated sampling |
| Central Limit Theorem | Sample means are normally distributed with sigma/sqrt(n) |
| Standard error | sigma/sqrt(n) — shrinks with sample size |
| Bias | Systematic error that doesn’t decrease with sample size |
| Confidence interval | Range that likely contains the true population mean |
The Central Limit Theorem is one of the most powerful ideas in statistics. It tells you that no matter what the population looks like, sample means will be approximately normal — and the precision improves predictably with sample size.