Data cleaning is one of the most common and important tasks of any data analysis.
In typical data analysis setting, we would might get our dataset from excel/csv/tsv file and perform a series of operations to make the data cleaner. For example, we would start with cleaning the names of variables to make it consistent, rename some column names, filter our empty rows or empty columns, select one or columns and create new columns in the data. In Python, Pandas has a numerous functionalities, like rename(), filter(), and query(), which enables us to cleanup the data before applying Machine Learning algorithms.
Recently, a new Python package pyjanitor, inspired by R package janitor, has made some of the data cleaning tasks really easier. You can think of pyjanitor as an extension package for Pandas, that lets you work with Pandas dataframe with new data cleaning functionalities that work with chaining multiple functions, but with function names that are verbs describing the action we do.
Pytjanitor’s github page nicely explains its goals
pyjanitor’s etymology has a two-fold relationship to “cleanliness”. Firstly, it’s about extending Pandas with convenient data cleaning routines. Secondly, it’s about providing a cleaner, method-chaining, verb-based API for common pandas routines.
In this post we will see how to get started with using pyjanitor for most common data cleaning steps. We will use a toy dataset to learn the functionalities of data cleaning in pyjanitor.
import pandas as pd import numpy as np
First, let us make sure we have pyjanitor installed. One can install pyjanitor using conda package manager
conda install pyjanitor -c conda-forge
Let us import pyjanitor and check its version
import pyjanitor pyjanitor.__version__ '0.20.10'
Let us create a toy dataframe from scratch using dictionary containing column names as keys and the column values as lists.
stocks = {"CompanyName":["Roku","Google",pd.NA], "DATE":["20202912","20202912",pd.NA], "STOCK Price":["300","1700",pd.NA], "DIvidend":[pd.NA,pd.NA,pd.NA]}
We can convert the dictionary into Pandas dataframe with from_dict() function in Pandas.
stocks_df = pd.DataFrame.from_dict(stocks) stocks_df
Note that our toy dataframe has some of the common problems that we typically need to clean up before data analysis. For example, a column name has two works in camel case style, another one has two words with space bwetween them another one is all upper case and another one is with random mixed case. And it also has a column that is empty and a row that is empty.
CompanyName DATE STOCK Price DIvidend 0 Roku 20202912 300 <NA> 1 Google 20202912 1700 <NA> 2 <NA> <NA> <NA> <NA>
Let us see how can we clean up this toy dataframe using pyjanitor’s functionalites.
Cleaning up column names with clean_names() in pyjanitor
We can use pyjanitor’s clean_names() function to clean up the column names of a Pandas dataframe. In our example, we can see that, pyjanitor’s clean_names() has converted all names to lower case, the column name with space between two words is spearated by underscore. And the column name is all upper case is lower case now and also the one camel case is single word now.
stocks_df.clean_names() companyname date stock_price dividend 0 Roku 20202912 300 <NA> 1 Google 20202912 1700 <NA> 2 <NA> <NA> <NA> <NA>
Removing empty columns and rows with remove_empty() in pyjanitor
One of the common challenges while using data from Excel or the data created manually is that you might often find columns and rows that are completely empty. Our toy dataset contains a row and column that is completely empty. We can use pyjanitor’s remove_empty() function to remove the empty row and column easily. We can also chain with another function to remove the empty rows/columns.
In the example below, we first clean up names and use chain operation to remove the empty row and column. And we use parenthesis notation to chain multiple functions in multiple lines.
(stocks_df .clean_names() .remove_empty()) companyname date stock_price 0 Roku 20202912 300 1 Google 20202912 1700
Renaming column with rename_column() in pyjanitor
We can rename columns in the dataframe with pyjanitor’s rename_column() function. Here we rename the column “companyname” to “company”.
(stocks_df .clean_names() .remove_empty() .rename_column('companyname',"company"))
company date stock_price 0 Roku 20202912 300 1 Google 20202912 1700
Add new column with add_column() in pyjanitor
We can also add new columns to the dataframe with pyjanitor’s add_column() function. Here we add a nee column “size” by providing the column values as a list.
(stocks_df .clean_names() .remove_empty() .rename_column('companyname',"company") .add_column("size",[1000,40000]))
companyname date stock_price size 0 Roku 20202912 300 1000 1 Google 20202912 1700 40000
Chain Pandas function and pyjanitor functions
So far we saw some of the functionalities of pyjanitor and showed how we can chain multiple functions. Since pyjanitor is an extension to Pandas, we can also combine pyjanitor’s functions with Pandas’ functions.
In this example below, we use pandas’ to_datetime() function to convert the date in string format to datetime format.
(stocks_df .clean_names() .remove_empty() .rename_column('companyname',"company") .add_column("size",[1000,40000]) .to_datetime('date',format='%Y%d%m'))
company date stock_price size 0 Roku 2020-12-29 300 1000 1 Google 2020-12-29 1700 40000
We can save the cleaned data
stocks_clean = (stocks_df .clean_names() .remove_empty() .rename_column('companyname',"company") .add_column("size",[1000,40000]) .to_datetime('date',format='%Y%d%m'))
and check the data types
stocks_clean.dtypes company object date datetime64[ns] stock_price object size int64 dtype: object
pyjanitor started as a wrapper to R’s janitor package and slowly acquired new functionalities including data reshaping capabilities like tidyr’s pivot_longer(). Tune in for a post on how to use pyjanitor’s pivot_longer() to reshape wide data to tody data form soon.