• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / Python / Pandas DataFrame / How to Load a Massive File as small chunks in Pandas?

How to Load a Massive File as small chunks in Pandas?

January 22, 2018 by cmdlinetips

How to load a big csv file in pandas in smaller chunks
How to load a big csv file in pandas in smaller chunks?
The longer you work in data science, the higher the chance that you might have to work with a really big file with thousands or millions of lines. Trying to load all the data at once in memory will not work as you will end up using all of your RAM and crash your computer. Pandas has a really nice option load a massive data frame and work with it. The solution to working with a massive file with thousands of lines is to load the file in smaller chunks and analyze with the smaller chunks.

Let us first load the pandas package.

# load pandas 
import pandas as pd

How to analyze a big file in smaller chunks with pandas chunksize?

Let us see an example of loading a big csv file in smaller chunks. We will use the gapminder data as an example with chunk size 1000. Here the chunk size 500 means, we will be reading 500 lines at a time.

# link to gapminder data as csv file
# from software carpentry website
csv_url='http://bit.ly/2cLzoxH''
# use chunk size 500
c_size = 500

Let us use pd.read_csv to read the csv file in chunks of 500 lines with chunksize=500 option. The code below prints the shape of the each smaller chunk data frame. Note that the first three chunks are of size 500 lines. Pandas is clever enough to know that the last chunk is smaller than 500 and load only the remaining line in the data frame, in this case 204 lines.

# load the big file in smaller chunks
for gm_chunk in pd.read_csv(csv_url,chunksize=c_size):
    print(gm_chunk.shape)
(500, 6)
(500, 6)
(500, 6)
(204, 6)

Let us see another example of reading/loading a big csv file and do some analysis. Here, with gapminder data, let us read the CSV file in chunks of 500 lines and compute the number entries (or rows) per each continent in the data set.

Let us use defaultdict from collections to keep a counter of number of rows per continent.

from collections import defaultdict
# default value of int is 0 with defaultdict
continent_dict = defaultdict(int) 

Let us load the big CSV file with chunnksize=500 and count the number of continent entries in each smaller chunk using the defaultdict.

for gm_chunk in pd.read_csv(csv_url,chunksize=500):
    for c in gm_chunk['continent']:
        continent_dict[c] += 1
# print the continent_dict 
>print(continent_dict)
defaultdict(int,
            {'Africa': 624,
             'Americas': 300,
             'Asia': 396,
             'Europe': 360,
             'Oceania': 24})

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Default ThumbnailHow to Get Unique Values from a Column in Pandas Data Frame? Default ThumbnailHow To Read a CSV File as Pandas Data Frame? Default ThumbnailHow to Get Frequency Counts of a Column in Pandas Dataframe: Pandas Tutorial Default ThumbnailHow To Add a New Column Using a Dictionary in Pandas Data Frame ?

Filed Under: Pandas DataFrame, Python, Python Tips, read_csv in Pandas Tagged With: load a big file in chunks, pandas chunksize, Pandas Dataframe, Python Tips

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version