Often we work with datasets of small sample sizes and rely on sample variances estimated from such datasets. One thing I have come across multiple times is that we lack the intuition/sense of how unreliable our sample variance estimates from data sets of small sample size can be. And I resort to a quick simulated example to show the effect of our sample variance estimates on sample sizes.
Here is a quick post on the danger of using small sample sized data to estimate sample variance. Basically rely on simulation and use a really simple scenario. And the basic idea is as follows
- Simulate small sample sized data (sample_size = 3) from known distribution, say Normal distribution with unit variance and some mean value and compute sample variance (and sample mean)
- We repeat the experiment 1000 times to see how the estimated sample variance from small dataset varies by visualizing as a histogram
Then we do this whole thing again, but this time with a data of reasonable “sample size” (sample_size=100) and look at the distribution of sample variances.
Let us get started by loading tidyverse.
library(tidyverse)
We will randomly sample data from Normal distribution with mean “4” and unit variance. For example, our sample with sample size 3 will look like
rnorm(n=3, mean=4, sd=1) ## [1] 5.427710 2.278588 3.287849
We will repeat this experiment with same mean and variance and with the sample size = 3, 1000 times. To do this we will first create a tibble with 1000 rows and use map() function from purrr to generate 1000 samples with sample size 3. We also keep all the samples as a list in the tibble.
tibble(rep=1:1000) %>% mutate(samples_n3 = map(rep,~rnorm(3, mean=4))) %>% head() ## # A tibble: 6 × 2 ## rep samples_n3 ## <int> <list> ## 1 1 <dbl [3]> ## 2 2 <dbl [3]> ## 3 3 <dbl [3]> ## 4 4 <dbl [3]> ## 5 5 <dbl [3]> ## 6 6 <dbl [3]>
Let us take a look at the simulated samples/data.
tibble(rep=1:1000) %>% mutate(samples_n3=map(rep, ~rnorm(3, mean=4))) %>% head() %>% pull(samples_n3)
We see that there are 1000 samples, each with data points generated from normal distribution with mean =4 and unit variance.
## [[1]] ## [1] 2.582321 3.916076 4.397008 ## ## [[2]] ## [1] 3.062093 4.222721 3.945142 ## ## [[3]] ## [1] 3.547045 5.539474 4.279876 ## ## [[4]] ## [1] 4.149406 3.449190 4.609996 ## ## [[5]] ## [1] 4.284263 4.145532 2.996633 ## ## [[6]] ## [1] 3.770074 5.601238 3.559961
Now let us compute sample variance for each sampled data. We use map_dbl() fucntion from purrr to compute the variance from each sample and add it as a new column.
Here we have named the sample variance column to reflect that it is a variance computed from data with sample size 3.
tibble(rep=1:1000) %>% mutate(samples_n3 = map(rep,~rnorm(3, mean=4)), variance_n3 = map_dbl(samples_n3, ~var(.))) %>% head() ## # A tibble: 6 × 3 ## rep samples_n3 variance_n3 ## <int> <list> <dbl> ## 1 1 <dbl [3]> 0.238 ## 2 2 <dbl [3]> 0.610 ## 3 3 <dbl [3]> 0.556 ## 4 4 <dbl [3]> 0.730 ## 5 5 <dbl [3]> 0.607 ## 6 6 <dbl [3]> 0.542
To save the samples and computed variance, let us save it in a variable.
set.seed(42) df <- tibble(rep = 1:1000) %>% mutate(samples_n3 = map(rep,~rnorm(3, mean=4)), means_n3 = map_dbl(samples_n3, ~mean(.)), variance_n3 = map_dbl(samples_n3, ~var(.)))
Distribution of sample variance when sample size is 3
First, let us look the distribution of the sample variances that we computed from data of sample size 3.
df %>% ggplot(aes(x = variance_n3))+ geom_histogram(bins = 100, color = "white")+ geom_vline(xintercept = 1, color = "red", size = 2)+ labs(x= "Sample Variance of N(4,1)", title="Sample Size = 3")
And this is how the distribution looks like. It is extremely skewed towards left, towards zero variance. About 60% of the time, sample variance estimates from sample size of 3 are an underestimate, i.e. less than 1, the true value. Also we see outlier examples where the sample variance is much higher than true variance. For some, sample variance is as large as 8, while data was generated with normal distribution with unit variance.
Comparing the distribution of sample variance when sample size is 3 with 100
Clearly, we saw our estimates of sample variance at small sample size looks bad. How about when the sample size is larger. How does the distribution of sample variance look like? We resort to the same simulation, but this time our sample size of the data we simulate is much bigger, 100 instead of 3.
set.seed(42) df <- tibble(rep=1:1000) %>% mutate(samples_n3=map(rep,~rnorm(3, mean=4)), means_n3=map_dbl(samples_n3, ~mean(.)), variance_n3 = map_dbl(samples_n3, ~var(.))) %>% mutate(samples_n100 = map(rep,~rnorm(100, mean=4)), means_n100 = map_dbl(samples_n100, ~mean(.)), variance_n100 = map_dbl(samples_n100, ~var(.)))
df %>% head() ## # A tibble: 6 × 7 ## rep samples_n3 means_n3 variance_n3 samples_n100 means_n100 variance_n100 ## <int> <list> <dbl> <dbl> <list> <dbl> <dbl> ## 1 1 <dbl [3]> 4.39 0.937 <dbl [100]> 4.02 0.809 ## 2 2 <dbl [3]> 4.31 0.143 <dbl [100]> 3.81 1.00 ## 3 3 <dbl [3]> 5.15 1.22 <dbl [100]> 4.08 1.20 ## 4 4 <dbl [3]> 5.18 1.39 <dbl [100]> 4.04 1.10 ## 5 5 <dbl [3]> 3.40 0.472 <dbl [100]> 3.91 0.897 ## 6 6 <dbl [3]> 3.23 2.89 <dbl [100]> 3.95 1.11
Here we simulate data with both small sample sizes and large sample sizes. With that we can compare their distributions of sample variances side-by-side. Let us make histograms of sample variances at sample size 3 and 100.
df %>% select(rep, starts_with("var")) %>% pivot_longer(-rep, names_to="vars", values_to="sample_variance") %>% mutate(vars=fct_recode(vars, `3`="variance_n3", `100`="variance_n100")) %>% mutate(vars=fct_relevel(vars, c("3", "100"))) %>% ggplot(aes(x=sample_variance, fill=vars))+ geom_histogram(bins=30, color="white")+ facet_wrap(~vars, scales="free_x")+ geom_vline(xintercept = 1, color="black")+ theme(legend.position = "none")+ labs(subtitle="Effect of Sample Sizes on Sample Variance") ggsave("Effect_of_Sample_size_on_Sample_Variance.png")
And we can clearly see the benefit of large sample size on the distribution of sample variances. When the sample size is 100, the distribution of sample variances is not (that) skewed. It does look symmetric with pretty tight distribution.
Instead of histograms, we can make a boxplot/sinaplot with the data points to see the distribution of sample variances at small and large sample sizes. Now we can actually see the data points and see how tight the distribution of sample variances at sample size 100.
library(ggforce) df %>% select(rep, starts_with("var")) %>% pivot_longer(-rep, names_to="vars", values_to="sample_variance") %>% mutate(vars=fct_recode(vars, `3`="variance_n3", `100`="variance_n100")) %>% mutate(vars=fct_relevel(vars, c("3", "100"))) %>% ggplot(aes(y=sample_variance, x=vars, color=vars))+ #geom_boxplot(outlier.shape = NA)+ geom_sina(alpha=0.2)+ #geom_jitter(width=0.1, alpha=0.1)+ #facet_wrap(~vars, scales="free_x")+ geom_hline(yintercept = 1, color="black")+ theme(legend.position = "none")+ labs(x= "Sample Size", subtitle="Effect of Sample Sizes on Sample Variance") ggsave("Effect_of_Sample_size_on_Sample_Variance_boxplot.png")
Effect of Sample size on Sample Variance
So far, we saw the behaviour of two extreme cases; sample sizes 3 and 100. The natural question that rises is how does the distribution change as we increase the size of datasets. Is there any sweet spot where large enough sample size is good enough for practical purposes. In other words, what is the sample size above which the distribution of sample variance is not skewed.
Let us simulate more datasets, this time with sample sizes n= 3, 6, 10, 30 and 100 to identify reasonable sample sizes needed for approximately symmetric sample variances. And here is how the sample variance distributions changes for different sample sizes.
Effect of Sample size on Sample Mean
We just learned that learning sample variance is hard and we need large sample size to get reliable variance estimates. You might wonder, what about sample mean? Actually, we can use the same data we generated above to estimate sample mean and see how the sample mean varies with respect to sample size.
Immediately we can see that sample mean doesn’t suffer from the same problem as sample variance at small sample sizes. Mainly the distribution of sample is not skewed at all and our estimate of sample mean improves with sample size as expected.