In this post, we will learn one of the most useful things you might do in doing data analysis. Here we will learn how to save the dataframe as a compressed file like gzip or zip file. Saving a file in compressed form can be of help when space is an issue.
Let us load Pandas.
# load pandas import pandas as pd
First, we will create a toy dataframe from scratch. We create two lists.
education = ["Bachelor's", "Less than Bachelor's","Master's","PhD","Professional"] salary = [110000,105000,126000,144200,96000]
And use the two lists as input to Pandas’ DataFrame function to create a new dataframe.
# Create dataframe in one step df = pd.DataFrame({"Education":education, "Salary":salary}) df Education Salary 0 Bachelor's 110000 1 Less than Bachelor's 105000 2 Master's 126000 3 PhD 144200 4 Professional 95967
Now that we have a Pandas dataframe, we are ready to learn to save the dataframe as CSV/TSV file.
Pandas to_csv() function is extremely versatile and can handle variety of situation in writing a dataframe to a file including saving as compressed file.
To save a Pandas dataframe as gzip file, we use ‘compression=”gzip”‘ in addition to the filename as argument to to_csv() function.
In this example below, we save our dataframe as csv file without row index in compressed, i.e. gzip file, form.
# write a pandas dataframe to gzipped CSV file df.to_csv("education_salary.csv.gz", index=False, compression="gzip")
In addition to gzip file, we can also compress the file in other formats. For example, to save the dataframe as zip file, we would use ‘compression=”zip”‘ as one of the arguments to to_csv() function.
# write a pandas dataframe to zipped CSV file df.to_csv("education_salary.csv.zip", index=False, compression="zip")
This post is part of the series on Byte Size Pandas: Pandas 101, a tutorial covering tips and tricks on using Pandas for data munging and analysis.