2 Ways to Randomly Sample Rows from a large CSV file

In this post, we will be learning how to randomly sample/select rows from a large CSV file that is either taking too long to load as a Pandas dataframe or can’t load at all.

The key idea is to not to load the whole file as a Pandas dataframe. Instead, we use skiprows argument in Pandas read_csv() function to read randomly selected rows.

We will learn two different ways to read randomly sampled rows as a Pandas dataframe. Thanks to Mike Kearny‘s tweet, where he shared an elegant approach to randomly sampling a file by using a probability to select a row or not. This approach is akin to tossing a coin for each row and deciding to select based Head or Tail.

Next we will see how to randomly sample n rows directly while loading the big CSV file. Note this is different from randomly sampling from a data that is already loaded as a dataframe.

Let us get started by loading the modules needed and we will use the Palmer Penguin data stored locally as a CSV file to illustrate randomly selecting rows and reading them as Pandas dataframe.

import random
import string
import pandas as pd

Just for this blog post, let us use just a few columns from Palmer penguins data. Also we save the subsetted data as CSV file to mimic our large CSV file.

import seaborn as sns
df = sns.load_dataset("penguins")
df[['species','island', 'body_mass_g','sex']].to_csv("penguins_subset_data.csv",index=False)

Let us load the csv file just to quickly see its content

filename = "penguins_subset_data.csv"
df = pd.read_csv(filename)

Our file has four columns about 350 rows.

df.head()

species	island	body_mass_g	sex
0	Adelie	Torgersen	3750.0	Male
1	Adelie	Torgersen	3800.0	Female
2	Adelie	Torgersen	3250.0	Female
3	Adelie	Torgersen	NaN	NaN
4	Adelie	Torgersen	3450.0	Female

Randomly sampling rows based on a probability

We can randomly select rows from a big CSV file and load it as Pandas dataframe by using skiprows argument with a probability of selecting a row or not. In this example we use 5% probability to get selected for loading. This means we would be loading 5% of rows randomly as a dataframe.

random.seed(4321)
pd.read_csv(filename, 
           skiprows=lambda x: x > 0 and random.random() >=0.05)

We can quickly see that we have selected 15 rows randomly from 345 rows about 4.3%.



       species	island	body_mass_g	sex
0	Adelie	Torgersen	3625.0	Female
1	Adelie	Biscoe	3500.0	Female
2	Adelie	Dream	3500.0	Female
3	Adelie	Dream	4450.0	Male
4	Adelie	Dream	4100.0	Male
5	Adelie	Dream	3900.0	Male
6	Chinstrap	Dream	3500.0	Female
7	Chinstrap	Dream	3725.0	Male
8	Chinstrap	Dream	2900.0	Female
9	Chinstrap	Dream	3850.0	Female
10	Gentoo	Biscoe	5550.0	Male
11	Gentoo	Biscoe	4350.0	Female
12	Gentoo	Biscoe	4900.0	Female
13	Gentoo	Biscoe	4700.0	Female
14	Gentoo	Biscoe	5250.0	Male

One of the biggest advantage with this method is that we don’t need to know how many rows of data we have.

Randomly sampling n rows

Another approach to randomly sample rows from a big CSV file is to preselect n rows randomly and use skiprows argument to skip the remaining lines. For this we need total number of lines in the big CSV file.

Let us first compute the number of rows in the file and randomly selecting n rows using random.sample() function.


with open(filename, "r") as f:
    # total number of rows in the file
    total_rows = sum(1 for row in f)
    # rows to be skipped
    skip_rows = random.sample(range(1,total_rows+1),total_rows-100)

We can wrap this into a small function and use the rows to be skipped as skiprows argument to Pandas’ read_csv() file.

import sys
def sample_n_from_csv(filename:str, n:int=100, total_rows:int=None) -> pd.DataFrame:
    if total_rows==None:
        with open(filename,"r") as fh:
            total_rows = sum(1 for row in fh)
    if(n>total_rows):
        print("Error: n > total_rows", file=sys.stderr) 
    skip_rows =  random.sample(range(1,total_rows+1), total_rows-n)
    return pd.read_csv(filename, skiprows=skip_rows)
            
    

Here is an example of randomly select 5 rows and loading them as a Pandas dataframe.

import random
sample_n_from_csv(filename, n=5)


	species	island	body_mass_g	sex
0	Adelie	Torgersen	3600.0	Female
1	Adelie	Dream	3400.0	Female
2	Gentoo	Biscoe	4400.0	Female
3	Gentoo	Biscoe	5500.0	Male
4	Gentoo	Biscoe	4925.0	Female