Statistics

Sampling, Bias & Inference

You can’t measure every person, test every product, or survey every voter. Instead, you take a sample — a smaller group — and use it to draw conclusions about the whole population. But how big does the sample need to be? And what can go wrong?


Part 1: The Population Distribution

Imagine a population with some average value (the population mean mu) and some natural variation (the population standard deviation sigma):

Population mean (mu)0
-33
Population sigma2
0.53
-10-8-6-4-2246810

This is the true distribution. In real life, we usually don’t know its exact shape — that’s the whole point of sampling. We’re trying to learn about this curve by taking samples.


Part 2: Sampling Distribution of the Mean

When you take a sample of size n and compute its mean, that sample mean is itself a random variable. If you repeated the sampling many times, the sample means would form their own distribution — the sampling distribution.

The Central Limit Theorem (CLT) tells us something remarkable:

Sampling distribution of 0ˉ:μ0ˉ=μ,σ0ˉ=σ0\text{Sampling distribution of } \bar0: \quad \mu_{\bar0} = \mu, \quad \sigma_{\bar0} = \frac{\sigma}{\sqrt0}

The sample means center on the population mean, but their spread shrinks as the sample size increases!

Sample size (n)4
1100
σ0ˉ=04\sigma_{\bar0} = \frac0{\sqrt{4}}
-8-6-4-224682Population (sigma=2)Sampling dist (n)
Try This

Drag the sample size slider and watch the green curve:

  • n = 1: The sampling distribution IS the population — no improvement
  • n = 4: The spread is halved (sigma/sqrt(4) = sigma/2)
  • n = 25: The spread is 1/5 of the original
  • n = 100: The spread is 1/10 — sample means cluster tightly around mu

This is the magic of the CLT: larger samples give more precise estimates!


Part 3: The Square Root Law

Notice how increasing n from 1 to 4 helps a lot, but going from 25 to 100 helps less dramatically? That’s because the spread shrinks with sqrt(n), not n itself:

Sample size (n)10
1200
Standard error=010\text{Standard error} = \frac0{\sqrt{10}}
20406080100120140160180200
-6-5-4-3-2-1123456123PopulationSample means
Connection

Diminishing returns: To cut the standard error in half, you need to quadruple the sample size. Going from n=100 to n=400 gives the same improvement as going from n=1 to n=4. This is why polling organizations can survey only 1000 people and get accurate results — but surveying 4000 people doesn’t improve accuracy by 4x, only by 2x.


Part 4: Bias — When Samples Mislead

Even with a large sample, your results can be wrong if the sample is biased — systematically unrepresentative of the population.

A biased sample has its center shifted away from the true population mean:

Sample bias0
-33
-8-6-4-22468True populationBiased sample (n=10)
0=0ˉ0μ0=0\text0 = \bar0_0 - \mu_0 = 0
Try This

When bias is 0, the sample is centered on the true population mean — that’s a representative sample. As you increase the bias, the sample’s center moves away.

No amount of increasing sample size fixes bias! A biased sample of 10,000 is still wrong. Random sampling is the key to avoiding bias.


Part 5: Confidence — Precision vs. Sample Size

As sample size grows, we become more confident about where the population mean is. A confidence interval gets narrower with more data:

Sample size (n)10
2100
95% CI width2×1.96×010\text{95\% CI width} \approx 2 \times 1.96 \times \frac0{\sqrt{10}}
-5-4-3-2-112345123Sampling distribution95% confidence region
Challenge

Challenge: A polling company wants to estimate voter support with a margin of error of 3 percentage points (sigma ~ 50 for percentages).

The margin of error is approximately 1.96 * sigma / sqrt(n). Set up the equation: 3 = 1.96 * 50 / sqrt(n). Solve for n.

How many voters do they need to survey?


Wrapping Up

ConceptKey Idea
Sampling distributionDistribution of sample means from repeated sampling
Central Limit TheoremSample means are normally distributed with sigma/sqrt(n)
Standard errorsigma/sqrt(n) — shrinks with sample size
BiasSystematic error that doesn’t decrease with sample size
Confidence intervalRange that likely contains the true population mean

The Central Limit Theorem is one of the most powerful ideas in statistics. It tells you that no matter what the population looks like, sample means will be approximately normal — and the precision improves predictably with sample size.

Take the Quiz