Sometimes, while doing data wrangling, we might need to get a quick look at the top rows with the largest or smallest values in a column. This kind of quick glance at the data reveal interesting information in a dataframe. Pandas dataframe easily enables one to have a quick look at the top rows either with largest or smallest values in a column.
Pandas library has function called nlargest makes it really easy to look at the top or bottom rows. Let us first load Pandas library.
import pandas as pd
Let us use gapminder data. Let us load the data from Carpentry’s github page and look at the data corresponding to the year 2007 alone.
# Carpentry url containing data data_url = 'http://bit.ly/2cLzoxH' # Load the data from Carpentry url gapminder = pd.read_csv(data_url) # filter the data to contain just year=2007 gapminder_2007 = gapminder[gapminder.year==2007]
Pandas nlargest function can take the number of rows we need as argument and the column name for which we are looking for largest values. Pandas nlargest function
Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.
Let us look at the top 3 rows of the dataframe with the largest population values using the column variable “pop”.
gapminder_2007.nlargest(3,'pop')
We just get two rows and see that China, India, and US are the top 3 countries with largest population.
country year pop continent lifeExp gdpPercap 299 China 2007 1.318683e+09 Asia 72.961 4959.114854 707 India 2007 1.110396e+09 Asia 64.698 2452.210407 1619 United States 2007 3.011399e+08 Americas 78.242 42951.653090
The function nlargest also has an argument keep that allows one to deal with duplicate values. keep can take {‘first’, ‘last’, ‘all’}, where
first : prioritize the first occurrence(s)
last : prioritize the last occurrence(s)
all : does not drop any duplicates
How to Get Top N Rows Based on Largest Values in Multiple Columns in Pandas?
In the above example we saw getting top rows ordered by values of a single column. Pandas nlargest function can take more than one variable to order the top rows.
We can give a list of variables as input to nlargest and get first n rows ordered by the list of columns in descending order.
# top n rows ordered by multiple columns gapminder_2007.nlargest(3,['lifeExp','gdpPercap'])
Here we get top 3 rows with largest values in column “lifeExp” and then “gdpPercap”.
country year pop continent lifeExp gdpPercap 803 Japan 2007 127467972.0 Asia 82.603 31656.06806 671 Hong Kong China 2007 6980412.0 Asia 82.208 39724.97867 695 Iceland 2007 301931.0 Europe 81.757 36180.78919
How to Get Top N Rows Based on Smallest Values of a Column in Pandas?
Just as you guessed, Pandas has the function nsmallest to select top rows of smallest values in one or more column, in descending order.
Let use see an example of using nsmallest on gapminder data. Here is how to get top 3 countries with smallest lifeExp.
gapminder_2007.nsmallest(3,'liefExp')
country year pop continent lifeExp gdpPercap 1463 Swaziland 2007 1133066.0 Africa 39.613 4513.480643 1043 Mozambique 2007 19951656.0 Africa 42.082 823.685621 1691 Zambia 2007 11746035.0 Africa 42.384 1271.211593