In this post, we will learn how to drop duplicate rows in a Pandas dataframe. We will use Pandas drop_duplicates() function to can delete duplicated rows with multiple examples.
One of the common data cleaning tasks is to make a decision on how to deal with duplicate rows in a data frame. If the whole row is duplicated exactly, the decision is simple. We can drop the duplicated row for any downstream analysis. Sometime, you may have to make a decision if only part of a row is duplicated.
Let us load Pandas
import pandas as pd
We will use gapminder dataset from Carpentries. We can see that it has 1704 rows and 6 columns.
data_url = 'http://bit.ly/2cLzoxH' gapminder = pd.read_csv(data_url) print(gapminder.shape) (1704, 6)
How to Drop/remove Completely Duplicated Rows?
First, let us create dataframe with duplicated rows. gapminder data set is well curated one, so there is not any row that is completely duplicated. To illustrate how to drop rows that are duplicated completely, let us concatenate the gapminder dataframe with a copy of its own. After concatenating, we will have each row duplicated completely two times.
We can join two dataframes using Pandas’ concat function. Here we specify axis=0 so that concat joins two dataframes by rows.
# concatenate two dataframes with concat() function in Pandas gapminder_duplicated = pd.concat([gapminder,gapminder],axis=0) gapminder_duplicated.shape (3408, 6)
Our new Pandas dataframe with duplicated rows has double the number of rows as the original gapminder dataframe. Basically, every row in the original data frame is duplicated.
Pandas function drop_duplicates() can delete duplicated rows. By default, drop_duplicates() function removes completely duplicated rows, i.e. every column element is identical
# remove duplicated rows using drop_duplicates() gapminder_duplicated.drop_duplicates()
We can verify that we have dropped the duplicate rows by checking the shape of the data frame.
# verify if all duplicated rows are dropped gapminder_duplicated.drop_duplicates().shape (1704, 6)
How to Drop/remove Partially Duplicated Rows based on Select Columns?
By default drop_duplicates function uses all the columns to detect if a row is a duplicate or not. Often you might want to remove rows based on duplicate values of one ore more columns. Pandas drop_duplicates function has an argument to specify which columns we need to use to identify duplicates.
For example, to remove duplicate rows using the column ‘continent’, we can use the argument “subset” and specify the column name we want to identify duplicate.
Let us drop duplicate rows using the original gapminder data frame and use subset argument with “continent”
# drop duplicates based on value of a column gapminder.drop_duplicates(subset = "continent")
We would expect that we will have just one row from each continent value and by default drop_duplicates() keeps the first row it sees with a continent value and drops all other rows as duplicates.
Note that all the country values start with “A”s.
country year pop continent lifeExp gdpPercap 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 12 Albania 1952 1282697.0 Europe 55.230 1601.056136 24 Algeria 1952 9279525.0 Africa 43.077 2449.008185 48 Argentina 1952 17876956.0 Americas 62.485 5911.315053 60 Australia 1952 8691212.0 Oceania 69.120 10039.595640
We can also keep the last occurrence of a column value by using the argument “keep=last”.
gapminder.drop_duplicates(subset="continent", keep="last")
Here we would see one row per each unique continent value, but dropping all rows except the last occurrence.
country year pop continent lifeExp gdpPercap 1103 New Zealand 2007 4115771.0 Oceania 80.204 25185.009110 1607 United Kingdom 2007 60776238.0 Europe 79.425 33203.261280 1643 Venezuela 2007 26084662.0 Americas 73.747 11415.805690 1679 Yemen Rep. 2007 22211743.0 Asia 62.698 2280.769906 1703 Zimbabwe 2007 12311143.0 Africa 43.487 469.709298
Note that all the country values start with letters at the end of alphabets.
We can use the subset argument with more than one column names. In such cases, drop_duplicates function will consider a row as duplicates, if the specified column names are identical.
To drop rows that have the same continent and year values, we can use subset argument with the column names as list.
gapminder.drop_duplicates(subset=["continent","year"])
Here we have dropped rows with identical continent and year value.
country year pop continent lifeExp gdpPercap 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030 2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
How to Find if one or More Columns Duplicated in Pandas?
Another common task in data munging is finding out if a specific column value is a duplicated or not. In this case the goal is not to remove duplicated rows, but find which rows has duplicate values for a specific column in a data frame.
Pandas duplciated() function:
Pandas has another useful function called duplicated to tell you if the values of a column is duplicated or not. We can apply this duplicated function to Index, Series and Datatframe.
For example, to find if the continent column values are duplicated or not, we can do
gapminder.continent.duplicated()
This will output a boolean series
0 False 1 True 2 True 3 True 4 True ... 1699 True 1700 True 1701 True 1702 True 1703 True Name: continent, Length: 1704, dtype: bool
We can also use duplicated function to the dataframe directly and specify which column we want to check for duplicates with subset argument as before. For example, to find which rows have the same continent and year values, we can use
gapminder.duplicated(subset=["continent", "year"]) 0 False 1 False 2 False 3 False 4 False ... 1699 True 1700 True 1701 True 1702 True 1703 True Length: 1704, dtype: bool