• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / Python / Python Tips / How to Get Top N Rows with in Each Group in Pandas?

How to Get Top N Rows with in Each Group in Pandas?

March 29, 2019 by cmdlinetips

In this post we will see how to get top N rows from a data frame such that the top values of a specific variable in each group defined by another variable. Note this is not the same as top N rows according to one variable in the whole dataframe.

Let us say we have gapminder data frame that has life expectancy values for countries in five continents. And we would like to see top 3 countries (rows) with large life expectancies (descending order) in each continent. Basically, we need top N rows in each group.

We earlier wrote a post on getting top N rows in a data frame, but this one has a slight twist 🙂 See the blogpost,”How to Select Top N Rows with the Largest Values in a Column(s) in Pandas?”

top N rows in each group: Step by Step

Getting top N rows with in each group involves multiple steps. First, let us see how to get top N rows within each group step by step and later we can combine some of the steps.

Let us first load gapminder data frame from Carpentries site and filter the data frame to contain data for the year 2007.

data_url = 'http://bit.ly/2cLzoxH'
gapminder = pd.read_csv(data_url)
gapminder_2007 = gapminder[gapminder.year==2007]

Let us load Pandas.

# load pandas
import pandas as pd

Since we want to find top N countries with highest life expectancy in each continent group, let us group our dataframe by “continent” using Pandas’s groupby function. We save the resulting grouped dataframe into a new variable.

df1 = gapminder_2007.groupby(["continent"])

Remember, the resulting grouped dataframe has all the data, but for each group (here continent) separately. Next, we take the grouped dataframe and use the function apply in Pandas to sort each group within the grouped data frame. Since we want top countries with highest life expectancy, we sort by the variable “lifeExp”. Inside apply function, we use lambda function to perform sorting by “lifeExp”.

df2= df1.apply(lambda x: x.sort_values(["lifeExp"]

We have saved the resulting grouped and sorted dataframe into another variable. If you take a look at the content of this grouped and sorted dataframe, we can see that it has multi-index one for continent and the other index for row numbers.

df2

                country  year         pop continent  lifeExp     gdpPercap
continent                                                                 
Africa    1271  Reunion  2007    798094.0    Africa   76.442   7670.122558
          911     Libya  2007   6036914.0    Africa   73.952  12057.499280
          1571  Tunisia  2007  10276158.0    Africa   73.923   7092.923025

We can reset the index using “reset_index” function. Here we use “drop=True” option to avoid index being added as column to the resulting data frame.

>df3=df2.reset_index(drop=True)
>df3.head()

   country  year         pop continent  lifeExp     gdpPercap
0  Reunion  2007    798094.0    Africa   76.442   7670.122558
1    Libya  2007   6036914.0    Africa   73.952  12057.499280
2  Tunisia  2007  10276158.0    Africa   73.923   7092.923025

So, now we have a sorted dataframe. If you examine the rows, we can see that first we have countries from Afria with lifeExp in descending order and the next are other continents sorted by lifeExp in descending order.
Now we are ready to select N rows from each group, in this example “continent”.

We can use groupby function with “continent” as argument and use head() function to select the first N rows. Since the rows within each continent is sorted by lifeExp, we will get top N rows with high lifeExp for each continent.

Let us see the top most country with high lifeExp in each continent.

df3.groupby('continent').head(1)

Here is the topmost row in each group.

       country  year          pop continent  lifeExp     gdpPercap
0      Reunion  2007     798094.0    Africa   76.442   7670.122558
52      Canada  2007   33390141.0  Americas   80.653  36319.235010
77       Japan  2007  127467972.0      Asia   82.603  31656.068060
110    Iceland  2007     301931.0    Europe   81.757  36180.789190
140  Australia  2007   20434176.0   Oceania   81.235  34435.367440

We got the top N rows within each group in multiple steps. We can combine these steps by chaining the commands. Here we have chained the steps so that there are just two steps to get top N rows within each group. And we will get the same answer as above.

# get dataframe sorted by life Expectancy in each continent 
g = gapminder_2007.groupby(["continent"]).apply(lambda x: x.sort_values(["lifeExp"], ascending = False)).reset_index(drop=True)
# select top N rows within each continent
g.groupby('continent').head(1)

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Default ThumbnailHow to Select Top N Rows with the Largest Values in a Column(s) in Pandas? Default ThumbnailHow to Add Group-Level Summary Statistic as a New Column in Pandas? Default ThumbnailHow to Implement Pandas Groupby operation with NumPy? Default ThumbnailPandas Groupby and Sum

Filed Under: Python Tips Tagged With: pandas groupby(), Pandas top N rows in each group, Pandas top rows

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version