Statistics

Sampling, Bias & Inference

You can’t measure every person, test every product, or survey every voter. Instead, you take a sample — a smaller group — and use it to draw conclusions about the whole population. But how big does the sample need to be? And what can go wrong?

Part 1: The Population Distribution

Imagine a population with some average value (the population mean mu) and some natural variation (the population standard deviation sigma):

Population mean (mu)0

-33

Population sigma2

0.53

This is the true distribution. In real life, we usually don’t know its exact shape — that’s the whole point of sampling. We’re trying to learn about this curve by taking samples.

Part 2: Sampling Distribution of the Mean

When you take a sample of size n and compute its mean, that sample mean is itself a random variable. If you repeated the sampling many times, the sample means would form their own distribution — the sampling distribution.

The Central Limit Theorem (CLT) tells us something remarkable:

\text{Sampling distribution of } \bar0: \quad \mu_{\bar0} = \mu, \quad \sigma_{\bar0} = \frac{\sigma}{\sqrt0}

The sample means center on the population mean, but their spread shrinks as the sample size increases!

Sample size (n)4

1100

\sigma_{\bar0} = \frac0{\sqrt{4}}

Try This

Drag the sample size slider and watch the green curve:

n = 1: The sampling distribution IS the population — no improvement
n = 4: The spread is halved (sigma/sqrt(4) = sigma/2)
n = 25: The spread is 1/5 of the original
n = 100: The spread is 1/10 — sample means cluster tightly around mu

This is the magic of the CLT: larger samples give more precise estimates!

Part 3: The Square Root Law

Notice how increasing n from 1 to 4 helps a lot, but going from 25 to 100 helps less dramatically? That’s because the spread shrinks with sqrt(n), not n itself:

Sample size (n)10

1200

\text{Standard error} = \frac0{\sqrt{10}}

Connection

Diminishing returns: To cut the standard error in half, you need to quadruple the sample size. Going from n=100 to n=400 gives the same improvement as going from n=1 to n=4. This is why polling organizations can survey only 1000 people and get accurate results — but surveying 4000 people doesn’t improve accuracy by 4x, only by 2x.

Part 4: Bias — When Samples Mislead

Even with a large sample, your results can be wrong if the sample is biased — systematically unrepresentative of the population.

A biased sample has its center shifted away from the true population mean:

Sample bias0

-33

\text0 = \bar0_0 - \mu_0 = 0

Try This

When bias is 0, the sample is centered on the true population mean — that’s a representative sample. As you increase the bias, the sample’s center moves away.

No amount of increasing sample size fixes bias! A biased sample of 10,000 is still wrong. Random sampling is the key to avoiding bias.

Part 5: Confidence — Precision vs. Sample Size

As sample size grows, we become more confident about where the population mean is. A confidence interval gets narrower with more data:

Sample size (n)10

2100

\text{95\% CI width} \approx 2 \times 1.96 \times \frac0{\sqrt{10}}

Challenge

Challenge: A polling company wants to estimate voter support with a margin of error of 3 percentage points (sigma ~ 50 for percentages).

The margin of error is approximately 1.96 * sigma / sqrt(n). Set up the equation: 3 = 1.96 * 50 / sqrt(n). Solve for n.

How many voters do they need to survey?

Wrapping Up

Concept	Key Idea
Sampling distribution	Distribution of sample means from repeated sampling
Central Limit Theorem	Sample means are normally distributed with sigma/sqrt(n)
Standard error	sigma/sqrt(n) — shrinks with sample size
Bias	Systematic error that doesn’t decrease with sample size
Confidence interval	Range that likely contains the true population mean

The Central Limit Theorem is one of the most powerful ideas in statistics. It tells you that no matter what the population looks like, sample means will be approximately normal — and the precision improves predictably with sample size.

Take the Quiz