• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / R / R Tips / On the effect of sample sizes on Sample Variance

On the effect of sample sizes on Sample Variance

June 3, 2022 by cmdlinetips

Often we work with datasets of small sample sizes and rely on sample variances estimated from such datasets. One thing I have come across multiple times is that we lack the intuition/sense of how unreliable our sample variance estimates from data sets of small sample size can be. And I resort to a quick simulated example to show the effect of our sample variance estimates on sample sizes.

Here is a quick post on the danger of using small sample sized data to estimate sample variance. Basically rely on simulation and use a really simple scenario. And the basic idea is as follows

  • Simulate small sample sized data (sample_size = 3) from known distribution, say Normal distribution with unit variance and some mean value and compute sample variance (and sample mean)
  • We repeat the experiment 1000 times to see how the estimated sample variance from small dataset varies by visualizing as a histogram

Then we do this whole thing again, but this time with a data of reasonable “sample size” (sample_size=100) and look at the distribution of sample variances.

Let us get started by loading tidyverse.

library(tidyverse)

We will randomly sample data from Normal distribution with mean “4” and unit variance. For example, our sample with sample size 3 will look like

rnorm(n=3, mean=4, sd=1)

## [1] 5.427710 2.278588 3.287849

We will repeat this experiment with same mean and variance and with the sample size = 3, 1000 times. To do this we will first create a tibble with 1000 rows and use map() function from purrr to generate 1000 samples with sample size 3. We also keep all the samples as a list in the tibble.

tibble(rep=1:1000) %>% 
  mutate(samples_n3 = map(rep,~rnorm(3, mean=4))) %>%
  head() 

## # A tibble: 6 × 2
##     rep samples_n3
##   <int> <list>    
## 1     1 <dbl [3]> 
## 2     2 <dbl [3]> 
## 3     3 <dbl [3]> 
## 4     4 <dbl [3]> 
## 5     5 <dbl [3]> 
## 6     6 <dbl [3]>

Let us take a look at the simulated samples/data.

tibble(rep=1:1000) %>% 
  mutate(samples_n3=map(rep, ~rnorm(3, mean=4))) %>%
  head() %>%
  pull(samples_n3)

We see that there are 1000 samples, each with data points generated from normal distribution with mean =4 and unit variance.

## [[1]]
## [1] 2.582321 3.916076 4.397008
## 
## [[2]]
## [1] 3.062093 4.222721 3.945142
## 
## [[3]]
## [1] 3.547045 5.539474 4.279876
## 
## [[4]]
## [1] 4.149406 3.449190 4.609996
## 
## [[5]]
## [1] 4.284263 4.145532 2.996633
## 
## [[6]]
## [1] 3.770074 5.601238 3.559961

Now let us compute sample variance for each sampled data. We use map_dbl() fucntion from purrr to compute the variance from each sample and add it as a new column.

Here we have named the sample variance column to reflect that it is a variance computed from data with sample size 3.

tibble(rep=1:1000) %>% 
  mutate(samples_n3 = map(rep,~rnorm(3, mean=4)),
        variance_n3 = map_dbl(samples_n3, ~var(.))) %>%
  head()
## # A tibble: 6 × 3
##     rep samples_n3 variance_n3
##   <int> <list>           <dbl>
## 1     1 <dbl [3]>        0.238
## 2     2 <dbl [3]>        0.610
## 3     3 <dbl [3]>        0.556
## 4     4 <dbl [3]>        0.730
## 5     5 <dbl [3]>        0.607
## 6     6 <dbl [3]>        0.542

To save the samples and computed variance, let us save it in a variable.

set.seed(42)
df <- tibble(rep = 1:1000) %>% 
  mutate(samples_n3 = map(rep,~rnorm(3, mean=4)),
         means_n3 = map_dbl(samples_n3, ~mean(.)),
         variance_n3 = map_dbl(samples_n3, ~var(.)))

Distribution of sample variance when sample size is 3

First, let us look the distribution of the sample variances that we computed from data of sample size 3.

df %>%
 ggplot(aes(x = variance_n3))+
  geom_histogram(bins = 100, color = "white")+
  geom_vline(xintercept = 1, color = "red", size = 2)+
  labs(x= "Sample Variance of N(4,1)", title="Sample Size = 3")

And this is how the distribution looks like. It is extremely skewed towards left, towards zero variance. About 60% of the time, sample variance estimates from sample size of 3 are an underestimate, i.e. less than 1, the true value. Also we see outlier examples where the sample variance is much higher than true variance. For some, sample variance is as large as 8, while data was generated with normal distribution with unit variance.

Distribution of Sample Variance at small sample size
Skewed Distribution of Sample Variance at small sample size

Comparing the distribution of sample variance when sample size is 3 with 100

Clearly, we saw our estimates of sample variance at small sample size looks bad. How about when the sample size is larger. How does the distribution of sample variance look like? We resort to the same simulation, but this time our sample size of the data we simulate is much bigger, 100 instead of 3.

set.seed(42)
df <- tibble(rep=1:1000) %>% 
  mutate(samples_n3=map(rep,~rnorm(3, mean=4)),
         means_n3=map_dbl(samples_n3, ~mean(.)),
         variance_n3 = map_dbl(samples_n3, ~var(.))) %>%
  mutate(samples_n100 = map(rep,~rnorm(100, mean=4)),
         means_n100 = map_dbl(samples_n100, ~mean(.)),
         variance_n100 = map_dbl(samples_n100, ~var(.))) 
df %>% head()

## # A tibble: 6 × 7
##     rep samples_n3 means_n3 variance_n3 samples_n100 means_n100 variance_n100
##   <int> <list>        <dbl>       <dbl> <list>            <dbl>         <dbl>
## 1     1 <dbl [3]>      4.39       0.937 <dbl [100]>        4.02         0.809
## 2     2 <dbl [3]>      4.31       0.143 <dbl [100]>        3.81         1.00 
## 3     3 <dbl [3]>      5.15       1.22  <dbl [100]>        4.08         1.20 
## 4     4 <dbl [3]>      5.18       1.39  <dbl [100]>        4.04         1.10 
## 5     5 <dbl [3]>      3.40       0.472 <dbl [100]>        3.91         0.897
## 6     6 <dbl [3]>      3.23       2.89  <dbl [100]>        3.95         1.11

Here we simulate data with both small sample sizes and large sample sizes. With that we can compare their distributions of sample variances side-by-side. Let us make histograms of sample variances at sample size 3 and 100.

df %>% 
  select(rep, starts_with("var")) %>%
  pivot_longer(-rep, names_to="vars", values_to="sample_variance") %>% 
  mutate(vars=fct_recode(vars, 
                        `3`="variance_n3",
                        `100`="variance_n100")) %>%
  mutate(vars=fct_relevel(vars, c("3",
                                "100"))) %>%
  ggplot(aes(x=sample_variance, fill=vars))+
  geom_histogram(bins=30, color="white")+
  facet_wrap(~vars, scales="free_x")+
  geom_vline(xintercept = 1, color="black")+
  theme(legend.position = "none")+
  labs(subtitle="Effect of Sample Sizes on Sample Variance")
ggsave("Effect_of_Sample_size_on_Sample_Variance.png")

And we can clearly see the benefit of large sample size on the distribution of sample variances. When the sample size is 100, the distribution of sample variances is not (that) skewed. It does look symmetric with pretty tight distribution.

Effect of Sample Size on Sample Variance

Effect of Sample Size on Sample Variance

Instead of histograms, we can make a boxplot/sinaplot with the data points to see the distribution of sample variances at small and large sample sizes. Now we can actually see the data points and see how tight the distribution of sample variances at sample size 100.
library(ggforce)
df %>% 
  select(rep, starts_with("var")) %>%
  pivot_longer(-rep, names_to="vars", values_to="sample_variance") %>% 
  mutate(vars=fct_recode(vars, 
                        `3`="variance_n3",
                        `100`="variance_n100")) %>%
  mutate(vars=fct_relevel(vars, c("3",
                                "100"))) %>%
  ggplot(aes(y=sample_variance, x=vars, color=vars))+
  #geom_boxplot(outlier.shape = NA)+
  geom_sina(alpha=0.2)+
  #geom_jitter(width=0.1, alpha=0.1)+
  #facet_wrap(~vars, scales="free_x")+
  geom_hline(yintercept = 1, color="black")+
  theme(legend.position = "none")+
  labs(x= "Sample Size", 
       subtitle="Effect of Sample Sizes on Sample Variance")
ggsave("Effect_of_Sample_size_on_Sample_Variance_boxplot.png")
Effect of Sample Size on Sample Variance
Effect of Sample Size on Sample Variance

Effect of Sample size on Sample Variance

So far, we saw the behaviour of two extreme cases; sample sizes 3 and 100. The natural question that rises is how does the distribution change as we increase the size of datasets. Is there any sweet spot where large enough sample size is good enough for practical purposes. In other words, what is the sample size above which the distribution of sample variance is not skewed.

Let us simulate more datasets, this time with sample sizes n= 3, 6, 10, 30 and 100 to identify reasonable sample sizes needed for approximately symmetric sample variances. And here is how the sample variance distributions changes for different sample sizes.

What is the optimal sample size for reliable sample variance
What is the optimal sample size for reliable sample variance

Effect of Sample size on Sample Mean

We just learned that learning sample variance is hard and we need large sample size to get reliable variance estimates. You might wonder, what about sample mean? Actually, we can use the same data we generated above to estimate sample mean and see how the sample mean varies with respect to sample size.

Effect of Sample Size on Sample Mean
Effect of Sample Size on Sample Mean

Immediately we can see that sample mean doesn’t suffer from the same problem as sample variance at small sample sizes. Mainly the distribution of sample is not skewed at all and our estimate of sample mean improves with sample size as expected.

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Categorize columns with across()How To Categorize Multiple Numerical Columns in R Replace NAs with Column/Row MeanHow to Replace NAs with column mean or row means with tidyverse Default Thumbnaildplyr mutate(): Create New Variables with mutate rowwise operationsRow-wise operations in R: compute row means in tidyverse

Filed Under: R Tips Tagged With: effect of sample size on sample variance, sample variance

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version