One of the most common data science tasks – data munge/data cleaning, is to combine data from multiple sources. When you have the data in tabular forms, Python Pandas offers great functions to merge/join data from multiple data frames. Here we will see example scenarios of common merging operations with simple toy data frames.
Let us first load pandas and create simple data frames.
import pandas as pd
Let us create three data frames with common column name. We will use the unique column name to merge the dataframes later.
The first dataframe contains customer ID and the purchased device information.
# create data dictionary >data_1 = {'Customer_ID': ['1', '2', '3', '4'], 'purchased_device': ['iPad', 'MacBook Air', 'Fire HD', 'iPhone 8']} # create pandas dataframe from dictionary >df_1 = pd.DataFrame(data_1, columns = ['Customer_ID', 'purchased_device']) # print dataframe >print(df_1) Customer_ID purchased_device 0 1 iPad 1 2 MacBook Air 2 3 Fire HD 3 4 iPhone 8
Let us create the second data frame. The second data frame contains customer ID and the purchased books on R and Data Science. Here, the customer ID column refers to the same person in the first data frame.
# create data dictionary data_2 = {'Customer_ID': ['1', '3', '5'], 'purchased_book': ['R for Data Science', 'Text Mining with R', 'Advanced R']} # create pandas dataframe from dictionary df_2 = pd.DataFrame(data_2, columns = ['Customer_ID', 'purchased_book']) # print dataframe print(df_2) Customer_ID purchased_book 0 1 R for Data Science 1 3 Text Mining with R 2 5 Advanced R
Let us create third Pandas data frame. The third data frame contains cusytomer ID and the purchased books on Python and Data Science.
# create data dictionary data_3 = {'Customer_ID': ['1', '2', '6'], 'purchased_book': ['Python Data Science Handbook', 'Python for Data Analysis', 'Python Data Science Handbook']} # create pandas dataframe from dictionary df_3 = pd.DataFrame(data_3, columns = ['Customer_ID', 'purchased_book']) # print dataframe print(df_3) Customer_ID purchased_book 0 1 Python Data Science Handbook 1 2 Python for Data Analysis 2 6 Python Data Science Handbook
How to Merge Two Data Frames in Pandas?
Two data frames can be merged in many ways depending on what we need in the merged data frame.
Pandas’ merge function has numerous options to help us merge two data frames. By default, merge performs inner join operation on a common variable/column to merge two data frames.
The simplest way to merge two data frames is to use merge function on first data frame and with the second data frame as argument.
>df_1.merge(df_2) Customer_ID purchased_device purchased_book 0 1 iPad R for Data Science 1 3 Fire HD Text Mining with R
Pandas’ merge function can automatically detect which columns are common between the data frames and use the common column to merge the two data frames. The new merged data frame has the just two items that are common to both the data frame.
Inner Merge Two Data Frames in Pandas
By default, Pandas Merge function does inner join. When using inner join, only the rows corresponding common customer_id, present in both the data frames, are kept.
We can see that, in merged data frame, only the rows corresponding to intersection of Customer_ID are present, i.e. the customer IDs 1 and 3.
We can specify how to merge two data frames using the “how” argument. Here we want to perform “inner” join, so we specify how=”inner”.
df_1.merge(df_2, how ='inner') Customer_ID purchased_device purchased_book 0 1 iPad R for Data Science 1 3 Fire HD Text Mining with R
Outer Merge Two Data Frames in Pandas
Another way to merge two data frames is to keep all the data in the two data frames. Pandas’ outer join keeps all the Customer_ID present in both data frames, union of Customer_ID in both the data frames. If any of the data frame is missing an ID, outer join gives NA value for the corresponding row.
For example, here the second data frame did not have any data for Customer_ID =2. Still, the outer join kept Customer_ID =2, but has NaN for column “puchased_book”.
# outer merge >df_1.merge(df_2, how='outer') Customer_ID purchased_device purchased_book 0 1 iPad R for Data Science 1 2 MacBook Air NaN 2 3 Fire HD Text Mining with R 3 4 iPhone 8 NaN 4 5 NaN Advanced R
Left Merge Two Data Frames in Pandas?
Sometimes you may want to keep all the for IDs present in one data frame irrespective whether the second data frame has any data or not. Pandas Merge’s left join keep all the IDs present in the first data frame.
# left merge >df_1.merge(df_2, how='left') Customer_ID purchased_device purchased_book 0 1 iPad R for Data Science 1 2 MacBook Air NaN 2 3 Fire HD Text Mining with R 3 4 iPhone 8 NaN
We can see that the above left join example has kept all Customer_ID present in the first data frame.
Right Merge Two Data Frames in Pandas
Pandas also offer right join, where the merged data frame contains all rows from the second data frame.
# right merge >df_1.merge(df_2, how='right') Customer_ID purchased_device purchased_book 0 1 iPad R for Data Science 1 3 Fire HD Text Mining with R 2 5 NaN Advanced R
We can see that the merged data frame has all rows from the second data frame.