The longer you work in data science, the higher the chance that you might have to work with a really big file with thousands or millions of lines. Trying to load all the data at once in memory will not work as you will end up using all of your RAM and crash your computer. Pandas has a really nice option load a massive data frame and work with it. The solution to working with a massive file with thousands of lines is to load the file in smaller chunks and analyze with the smaller chunks.
Let us first load the pandas package.
# load pandas import pandas as pd
How to analyze a big file in smaller chunks with pandas chunksize?
Let us see an example of loading a big csv file in smaller chunks. We will use the gapminder data as an example with chunk size 1000. Here the chunk size 500 means, we will be reading 500 lines at a time.
# link to gapminder data as csv file # from software carpentry website csv_url='http://bit.ly/2cLzoxH'' # use chunk size 500 c_size = 500
Let us use pd.read_csv to read the csv file in chunks of 500 lines with chunksize=500 option. The code below prints the shape of the each smaller chunk data frame. Note that the first three chunks are of size 500 lines. Pandas is clever enough to know that the last chunk is smaller than 500 and load only the remaining line in the data frame, in this case 204 lines.
# load the big file in smaller chunks for gm_chunk in pd.read_csv(csv_url,chunksize=c_size): print(gm_chunk.shape) (500, 6) (500, 6) (500, 6) (204, 6)
Let us see another example of reading/loading a big csv file and do some analysis. Here, with gapminder data, let us read the CSV file in chunks of 500 lines and compute the number entries (or rows) per each continent in the data set.
Let us use defaultdict from collections to keep a counter of number of rows per continent.
from collections import defaultdict # default value of int is 0 with defaultdict continent_dict = defaultdict(int)
Let us load the big CSV file with chunnksize=500 and count the number of continent entries in each smaller chunk using the defaultdict.
for gm_chunk in pd.read_csv(csv_url,chunksize=500): for c in gm_chunk['continent']: continent_dict[c] += 1
# print the continent_dict >print(continent_dict) defaultdict(int, {'Africa': 624, 'Americas': 300, 'Asia': 396, 'Europe': 360, 'Oceania': 24})