In this post, we will be learning how to randomly sample/select rows from a large CSV file that is either taking too long to load as a Pandas dataframe or can’t load at all.
The key idea is to not to load the whole file as a Pandas dataframe. Instead, we use skiprows argument in Pandas read_csv() function to read randomly selected rows.
We will learn two different ways to read randomly sampled rows as a Pandas dataframe. Thanks to Mike Kearny‘s tweet, where he shared an elegant approach to randomly sampling a file by using a probability to select a row or not. This approach is akin to tossing a coin for each row and deciding to select based Head or Tail.
Next we will see how to randomly sample n rows directly while loading the big CSV file. Note this is different from randomly sampling from a data that is already loaded as a dataframe.
Is a large CSV file taking too long to #pd_read_csv() into #python? Here are two different ways to take a random sample from it. pic.twitter.com/QDHGBof3QY
— Mike Kearney? (@kearneymw) January 14, 2022
Let us get started by loading the modules needed and we will use the Palmer Penguin data stored locally as a CSV file to illustrate randomly selecting rows and reading them as Pandas dataframe.
import random import string import pandas as pd
Just for this blog post, let us use just a few columns from Palmer penguins data. Also we save the subsetted data as CSV file to mimic our large CSV file.
import seaborn as sns df = sns.load_dataset("penguins") df[['species','island', 'body_mass_g','sex']].to_csv("penguins_subset_data.csv",index=False)
Let us load the csv file just to quickly see its content
filename = "penguins_subset_data.csv" df = pd.read_csv(filename)
Our file has four columns about 350 rows.
df.head() species island body_mass_g sex 0 Adelie Torgersen 3750.0 Male 1 Adelie Torgersen 3800.0 Female 2 Adelie Torgersen 3250.0 Female 3 Adelie Torgersen NaN NaN 4 Adelie Torgersen 3450.0 Female
Randomly sampling rows based on a probability
We can randomly select rows from a big CSV file and load it as Pandas dataframe by using skiprows argument with a probability of selecting a row or not. In this example we use 5% probability to get selected for loading. This means we would be loading 5% of rows randomly as a dataframe.
random.seed(4321) pd.read_csv(filename, skiprows=lambda x: x > 0 and random.random() >=0.05)
We can quickly see that we have selected 15 rows randomly from 345 rows about 4.3%.
species island body_mass_g sex 0 Adelie Torgersen 3625.0 Female 1 Adelie Biscoe 3500.0 Female 2 Adelie Dream 3500.0 Female 3 Adelie Dream 4450.0 Male 4 Adelie Dream 4100.0 Male 5 Adelie Dream 3900.0 Male 6 Chinstrap Dream 3500.0 Female 7 Chinstrap Dream 3725.0 Male 8 Chinstrap Dream 2900.0 Female 9 Chinstrap Dream 3850.0 Female 10 Gentoo Biscoe 5550.0 Male 11 Gentoo Biscoe 4350.0 Female 12 Gentoo Biscoe 4900.0 Female 13 Gentoo Biscoe 4700.0 Female 14 Gentoo Biscoe 5250.0 Male
One of the biggest advantage with this method is that we don’t need to know how many rows of data we have.
Randomly sampling n rows
Another approach to randomly sample rows from a big CSV file is to preselect n rows randomly and use skiprows argument to skip the remaining lines. For this we need total number of lines in the big CSV file.
Let us first compute the number of rows in the file and randomly selecting n rows using random.sample() function.
with open(filename, "r") as f: # total number of rows in the file total_rows = sum(1 for row in f) # rows to be skipped skip_rows = random.sample(range(1,total_rows+1),total_rows-100)
We can wrap this into a small function and use the rows to be skipped as skiprows argument to Pandas’ read_csv() file.
import sys def sample_n_from_csv(filename:str, n:int=100, total_rows:int=None) -> pd.DataFrame: if total_rows==None: with open(filename,"r") as fh: total_rows = sum(1 for row in fh) if(n>total_rows): print("Error: n > total_rows", file=sys.stderr) skip_rows = random.sample(range(1,total_rows+1), total_rows-n) return pd.read_csv(filename, skiprows=skip_rows)
Here is an example of randomly select 5 rows and loading them as a Pandas dataframe.
import random sample_n_from_csv(filename, n=5) species island body_mass_g sex 0 Adelie Torgersen 3600.0 Female 1 Adelie Dream 3400.0 Female 2 Gentoo Biscoe 4400.0 Female 3 Gentoo Biscoe 5500.0 Male 4 Gentoo Biscoe 4925.0 Female