How To Drop Duplicate Rows in Pandas?

Pandas drop_duplicates(): remove duplicated data from a dataframe
Pandas drop_duplicates(): remove duplicated rows from a dataframe

In this post, we will learn how to drop duplicate rows in a Pandas dataframe. We will use Pandas drop_duplicates() function to can delete duplicated rows with multiple examples.

Pandas drop_duplicates(): remove duplicated rows from a dataframe

One of the common data cleaning tasks is to make a decision on how to deal with duplicate rows in a data frame. If the whole row is duplicated exactly, the decision is simple. We can drop the duplicated row for any downstream analysis. Sometime, you may have to make a decision if only part of a row is duplicated.

Let us load Pandas

import pandas as pd

We will use gapminder dataset from Carpentries. We can see that it has 1704 rows and 6 columns.

data_url = 'http://bit.ly/2cLzoxH'
gapminder = pd.read_csv(data_url)
print(gapminder.shape)
(1704, 6)

How to Drop/remove Completely Duplicated Rows?

First, let us create dataframe with duplicated rows. gapminder data set is well curated one, so there is not any row that is completely duplicated. To illustrate how to drop rows that are duplicated completely, let us concatenate the gapminder dataframe with a copy of its own. After concatenating, we will have each row duplicated completely two times.

We can join two dataframes using Pandas’ concat function. Here we specify axis=0 so that concat joins two dataframes by rows.

# concatenate two dataframes with concat() function in Pandas
gapminder_duplicated = pd.concat([gapminder,gapminder],axis=0)
gapminder_duplicated.shape
(3408, 6)

Our new Pandas dataframe with duplicated rows has double the number of rows as the original gapminder dataframe. Basically, every row in the original data frame is duplicated.

Pandas function drop_duplicates() can delete duplicated rows. By default, drop_duplicates() function removes completely duplicated rows, i.e. every column element is identical

# remove duplicated rows using drop_duplicates()
gapminder_duplicated.drop_duplicates()

We can verify that we have dropped the duplicate rows by checking the shape of the data frame.

# verify if all duplicated rows are dropped
gapminder_duplicated.drop_duplicates().shape
(1704, 6)

How to Drop/remove Partially Duplicated Rows based on Select Columns?

By default drop_duplicates function uses all the columns to detect if a row is a duplicate or not. Often you might want to remove rows based on duplicate values of one ore more columns. Pandas drop_duplicates function has an argument to specify which columns we need to use to identify duplicates.

For example, to remove duplicate rows using the column ‘continent’, we can use the argument “subset” and specify the column name we want to identify duplicate.

Let us drop duplicate rows using the original gapminder data frame and use subset argument with “continent”

# drop duplicates based on value of a column
gapminder.drop_duplicates(subset = "continent")

We would expect that we will have just one row from each continent value and by default drop_duplicates() keeps the first row it sees with a continent value and drops all other rows as duplicates.

Note that all the country values start with “A”s.

        country  year         pop continent  lifeExp     gdpPercap
0   Afghanistan  1952   8425333.0      Asia   28.801    779.445314
12      Albania  1952   1282697.0    Europe   55.230   1601.056136
24      Algeria  1952   9279525.0    Africa   43.077   2449.008185
48    Argentina  1952  17876956.0  Americas   62.485   5911.315053
60    Australia  1952   8691212.0   Oceania   69.120  10039.595640

We can also keep the last occurrence of a column value by using the argument “keep=last”.

gapminder.drop_duplicates(subset="continent", keep="last")

Here we would see one row per each unique continent value, but dropping all rows except the last occurrence.

             country  year         pop continent  lifeExp     gdpPercap
1103     New Zealand  2007   4115771.0   Oceania   80.204  25185.009110
1607  United Kingdom  2007  60776238.0    Europe   79.425  33203.261280
1643       Venezuela  2007  26084662.0  Americas   73.747  11415.805690
1679      Yemen Rep.  2007  22211743.0      Asia   62.698   2280.769906
1703        Zimbabwe  2007  12311143.0    Africa   43.487    469.709298

Note that all the country values start with letters at the end of alphabets.

We can use the subset argument with more than one column names. In such cases, drop_duplicates function will consider a row as duplicates, if the specified column names are identical.

To drop rows that have the same continent and year values, we can use subset argument with the column names as list.

gapminder.drop_duplicates(subset=["continent","year"])

Here we have dropped rows with identical continent and year value.

       country  year         pop continent  lifeExp   gdpPercap
0  Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1  Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2  Afghanistan  1962  10267083.0      Asia   31.997  853.100710

How to Find if one or More Columns Duplicated in Pandas?

Another common task in data munging is finding out if a specific column value is a duplicated or not. In this case the goal is not to remove duplicated rows, but find which rows has duplicate values for a specific column in a data frame.

Pandas duplciated() function:

Pandas has another useful function called duplicated to tell you if the values of a column is duplicated or not. We can apply this duplicated function to Index, Series and Datatframe.

For example, to find if the continent column values are duplicated or not, we can do

gapminder.continent.duplicated()

This will output a boolean series

0       False
1        True
2        True
3        True
4        True
        ...  
1699     True
1700     True
1701     True
1702     True
1703     True
Name: continent, Length: 1704, dtype: bool

We can also use duplicated function to the dataframe directly and specify which column we want to check for duplicates with subset argument as before. For example, to find which rows have the same continent and year values, we can use

gapminder.duplicated(subset=["continent", "year"])
0       False
1       False
2       False
3       False
4       False
        ...  
1699     True
1700     True
1701     True
1702     True
1703     True
Length: 1704, dtype: bool