Introduction to Data Cleaning with Pyjanitor

Data cleaning is one of the most common and important tasks of any data analysis.

In typical data analysis setting, we would might get our dataset from excel/csv/tsv file and perform a series of operations to make the data cleaner. For example, we would start with cleaning the names of variables to make it consistent, rename some column names, filter our empty rows or empty columns, select one or columns and create new columns in the data. In Python, Pandas has a numerous functionalities, like rename(), filter(), and query(), which enables us to cleanup the data before applying Machine Learning algorithms.

Data cleaning with pyjanitor

Recently, a new Python package pyjanitor, inspired by R package janitor, has made some of the data cleaning tasks really easier. You can think of pyjanitor as an extension package for Pandas, that lets you work with Pandas dataframe with new data cleaning functionalities that work with chaining multiple functions, but with function names that are verbs describing the action we do.

Pytjanitor’s github page nicely explains its goals

pyjanitor’s etymology has a two-fold relationship to “cleanliness”. Firstly, it’s about extending Pandas with convenient data cleaning routines. Secondly, it’s about providing a cleaner, method-chaining, verb-based API for common pandas routines.

In this post we will see how to get started with using pyjanitor for most common data cleaning steps. We will use a toy dataset to learn the functionalities of data cleaning in pyjanitor.

import pandas as pd
import numpy as np

First, let us make sure we have pyjanitor installed. One can install pyjanitor using conda package manager

conda install pyjanitor -c conda-forge

Let us import pyjanitor and check its version

import pyjanitor
pyjanitor.__version__
'0.20.10'

Let us create a toy dataframe from scratch using dictionary containing column names as keys and the column values as lists.

stocks = {"CompanyName":["Roku","Google",pd.NA],
         "DATE":["20202912","20202912",pd.NA],
         "STOCK Price":["300","1700",pd.NA],
          "DIvidend":[pd.NA,pd.NA,pd.NA]}

We can convert the dictionary into Pandas dataframe with from_dict() function in Pandas.

stocks_df = pd.DataFrame.from_dict(stocks)
stocks_df

Note that our toy dataframe has some of the common problems that we typically need to clean up before data analysis. For example, a column name has two works in camel case style, another one has two words with space bwetween them another one is all upper case and another one is with random mixed case. And it also has a column that is empty and a row that is empty.

	CompanyName	DATE	STOCK Price	DIvidend
0	Roku	20202912	300	<NA>
1	Google	20202912	1700	<NA>
2	<NA>	<NA>	<NA>	<NA>

Let us see how can we clean up this toy dataframe using pyjanitor’s functionalites.

Cleaning up column names with clean_names() in pyjanitor

We can use pyjanitor’s clean_names() function to clean up the column names of a Pandas dataframe. In our example, we can see that, pyjanitor’s clean_names() has converted all names to lower case, the column name with space between two words is spearated by underscore. And the column name is all upper case is lower case now and also the one camel case is single word now.

stocks_df.clean_names()

	companyname	date	stock_price	dividend
0	Roku	20202912	300	<NA>
1	Google	20202912	1700	<NA>
2	<NA>	<NA>	<NA>	<NA>

Removing empty columns and rows with remove_empty() in pyjanitor

One of the common challenges while using data from Excel or the data created manually is that you might often find columns and rows that are completely empty. Our toy dataset contains a row and column that is completely empty. We can use pyjanitor’s remove_empty() function to remove the empty row and column easily. We can also chain with another function to remove the empty rows/columns.

In the example below, we first clean up names and use chain operation to remove the empty row and column. And we use parenthesis notation to chain multiple functions in multiple lines.

(stocks_df
 .clean_names()
 .remove_empty())

  companyname	date	stock_price
0	Roku	20202912	300
1	Google	20202912	1700

Renaming column with rename_column() in pyjanitor

We can rename columns in the dataframe with pyjanitor’s rename_column() function. Here we rename the column “companyname” to “company”.

(stocks_df
 .clean_names()
 .remove_empty()
 .rename_column('companyname',"company"))
company	date	stock_price
0	Roku	20202912	300
1	Google	20202912	1700

Add new column with add_column() in pyjanitor

We can also add new columns to the dataframe with pyjanitor’s add_column() function. Here we add a nee column “size” by providing the column values as a list.

(stocks_df
 .clean_names()
 .remove_empty()
 .rename_column('companyname',"company")
 .add_column("size",[1000,40000]))
companyname	date	stock_price	size
0	Roku	20202912	300	1000
1	Google	20202912	1700	40000

Chain Pandas function and pyjanitor functions

So far we saw some of the functionalities of pyjanitor and showed how we can chain multiple functions. Since pyjanitor is an extension to Pandas, we can also combine pyjanitor’s functions with Pandas’ functions.

In this example below, we use pandas’ to_datetime() function to convert the date in string format to datetime format.

(stocks_df
 .clean_names()
 .remove_empty()
 .rename_column('companyname',"company")
 .add_column("size",[1000,40000])
 .to_datetime('date',format='%Y%d%m'))
	company	date	stock_price	size
0	Roku	2020-12-29	300	1000
1	Google	2020-12-29	1700	40000

We can save the cleaned data

stocks_clean = (stocks_df
 .clean_names()
 .remove_empty()
 .rename_column('companyname',"company")
 .add_column("size",[1000,40000])
 .to_datetime('date',format='%Y%d%m'))

and check the data types

stocks_clean.dtypes

company                object
date           datetime64[ns]
stock_price            object
size                    int64
dtype: object

pyjanitor started as a wrapper to R’s janitor package and slowly acquired new functionalities including data reshaping capabilities like tidyr’s pivot_longer(). Tune in for a post on how to use pyjanitor’s pivot_longer() to reshape wide data to tody data form soon.