• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / Pandas 101 / 2 Ways to Randomly Sample Rows from a large CSV file

2 Ways to Randomly Sample Rows from a large CSV file

July 10, 2022 by cmdlinetips

In this post, we will be learning how to randomly sample/select rows from a large CSV file that is either taking too long to load as a Pandas dataframe or can’t load at all.

The key idea is to not to load the whole file as a Pandas dataframe. Instead, we use skiprows argument in Pandas read_csv() function to read randomly selected rows.

We will learn two different ways to read randomly sampled rows as a Pandas dataframe. Thanks to Mike Kearny‘s tweet, where he shared an elegant approach to randomly sampling a file by using a probability to select a row or not. This approach is akin to tossing a coin for each row and deciding to select based Head or Tail.

Next we will see how to randomly sample n rows directly while loading the big CSV file. Note this is different from randomly sampling from a data that is already loaded as a dataframe.

Is a large CSV file taking too long to #pd_read_csv() into #python? Here are two different ways to take a random sample from it. pic.twitter.com/QDHGBof3QY

— Mike Kearney? (@kearneymw) January 14, 2022

Let us get started by loading the modules needed and we will use the Palmer Penguin data stored locally as a CSV file to illustrate randomly selecting rows and reading them as Pandas dataframe.

import random
import string
import pandas as pd

Just for this blog post, let us use just a few columns from Palmer penguins data. Also we save the subsetted data as CSV file to mimic our large CSV file.

import seaborn as sns
df = sns.load_dataset("penguins")
df[['species','island', 'body_mass_g','sex']].to_csv("penguins_subset_data.csv",index=False)

Let us load the csv file just to quickly see its content

filename = "penguins_subset_data.csv"
df = pd.read_csv(filename)

Our file has four columns about 350 rows.

df.head()

species	island	body_mass_g	sex
0	Adelie	Torgersen	3750.0	Male
1	Adelie	Torgersen	3800.0	Female
2	Adelie	Torgersen	3250.0	Female
3	Adelie	Torgersen	NaN	NaN
4	Adelie	Torgersen	3450.0	Female

Randomly sampling rows based on a probability

We can randomly select rows from a big CSV file and load it as Pandas dataframe by using skiprows argument with a probability of selecting a row or not. In this example we use 5% probability to get selected for loading. This means we would be loading 5% of rows randomly as a dataframe.

random.seed(4321)
pd.read_csv(filename, 
           skiprows=lambda x: x > 0 and random.random() >=0.05)

We can quickly see that we have selected 15 rows randomly from 345 rows about 4.3%.



       species	island	body_mass_g	sex
0	Adelie	Torgersen	3625.0	Female
1	Adelie	Biscoe	3500.0	Female
2	Adelie	Dream	3500.0	Female
3	Adelie	Dream	4450.0	Male
4	Adelie	Dream	4100.0	Male
5	Adelie	Dream	3900.0	Male
6	Chinstrap	Dream	3500.0	Female
7	Chinstrap	Dream	3725.0	Male
8	Chinstrap	Dream	2900.0	Female
9	Chinstrap	Dream	3850.0	Female
10	Gentoo	Biscoe	5550.0	Male
11	Gentoo	Biscoe	4350.0	Female
12	Gentoo	Biscoe	4900.0	Female
13	Gentoo	Biscoe	4700.0	Female
14	Gentoo	Biscoe	5250.0	Male

One of the biggest advantage with this method is that we don’t need to know how many rows of data we have.

Randomly sampling n rows

Another approach to randomly sample rows from a big CSV file is to preselect n rows randomly and use skiprows argument to skip the remaining lines. For this we need total number of lines in the big CSV file.

Let us first compute the number of rows in the file and randomly selecting n rows using random.sample() function.


with open(filename, "r") as f:
    # total number of rows in the file
    total_rows = sum(1 for row in f)
    # rows to be skipped
    skip_rows = random.sample(range(1,total_rows+1),total_rows-100)

We can wrap this into a small function and use the rows to be skipped as skiprows argument to Pandas’ read_csv() file.

import sys
def sample_n_from_csv(filename:str, n:int=100, total_rows:int=None) -> pd.DataFrame:
    if total_rows==None:
        with open(filename,"r") as fh:
            total_rows = sum(1 for row in fh)
    if(n>total_rows):
        print("Error: n > total_rows", file=sys.stderr) 
    skip_rows =  random.sample(range(1,total_rows+1), total_rows-n)
    return pd.read_csv(filename, skiprows=skip_rows)
            
    

Here is an example of randomly select 5 rows and loading them as a Pandas dataframe.

import random
sample_n_from_csv(filename, n=5)


	species	island	body_mass_g	sex
0	Adelie	Torgersen	3600.0	Female
1	Adelie	Dream	3400.0	Female
2	Gentoo	Biscoe	4400.0	Female
3	Gentoo	Biscoe	5500.0	Male
4	Gentoo	Biscoe	4925.0	Female

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Default ThumbnailHow to Select Columns/Rows by substring match in Pandas Default Thumbnail13 Tips to Randomly Select Rows with tidyverse Default ThumbnailHow to randomly sample letters in Python List of plots using map() in purrrHow to create a list of plot objects and save them as files

Filed Under: Pandas 101, Python Tips Tagged With: Python sample a large CSV file

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version