How To Select Columns Using Prefix/Suffix of Column Names in Pandas?

How to Select Columns with Prefix in Pandas Python
Selecting one or more columns from a data frame is straightforward in Pandas. For example, if we want to select multiple columns with names of the columns as a list, we can one of the methods illustrated in

Sometimes you may be working with a larger dataframe with many columns and you don’t exactly know the column names you want. However, you may know that the column names start with some prefix or end with some suffix and interested in some of those columns.

In such a scenario, basically we are interested in how to select columns using prefix or suffix of columns names in Pandas. Basically, we need to do some kind of pattern matching to identify the columns of interest.

In this post we will see examples of

  1. How to select multiple columns that start with a prefix?
  2. How to selecting columns that end with a suffix?

We will use two different ways to select columns with prefix/suffix.

We will first use Pandas filter function with some simple regular expression for pattern matching to select the columns of interest. And then, we will use Pandas’ loc function to do the same.

Let us first load Pandas package

import pandas as pd

We will use gapminder data from Carpentries website.

data_url = "https://goo.gl/ioc2Td"
gapminder = pd.read_csv(data_url)

Note that this file contains gapminder data in wider form. It has 142 rows and 38 columns.

print(gapminder.head(3))

  continent  country  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
0    Africa  Algeria     2449.008185     3013.976023     2550.816880   
1    Africa   Angola     3520.610273     3827.940465     4269.276742   
2    Africa    Benin     1062.752200      959.601080      949.499064   

And the column names have some variable as prefixes, like gdpPercap, lifeExp, and so on. Also the column names end with numerical suffix.


Index(['continent', 'country', 'gdpPercap_1952', 'gdpPercap_1957',
       'gdpPercap_1962', 'gdpPercap_1967', 'gdpPercap_1972', 'gdpPercap_1977'],
      dtype='object')

Let us first use Pandas’ filter function and regular expression pattern to select columns starting with a prefix. Pandas’ filter function takes two main arguments and one of them is regex, where we need to specify the pattern we are interested in as regular expression. And we also need to specify axis=1 to select columns.

Select Columns with a Prefix using Pandas filter

For example, if we are interested in selecting columns starting with “lifeExp”, the regular expression for the pattern is “^lifeExp”. In the regular expression “^” represents we are interested in patterns that starts with. So our argument for “regexp” will be regexp=’^lifeExp’.

gapminder.filter(regex='^lifeExp',axis=1).head()

We will get smaller dataframe with the column names starting with “lifeExp”.

lifeExp_1952	lifeExp_1957	lifeExp_1962	lifeExp_1967	lifeExp_1972 \
0	43.077	45.685	48.303	51.407	54.518
1	30.015	31.999	34.000	35.985	37.928
2	38.223	40.358	42.618	44.885	47.014

Select Columns with a suffix using Pandas filter

Let us select columns with names ending with a suffix in Pandas dataframe using filter function. As before, we need to come up with regular expression for the pattern we are interested in. Here our pattern is column names ending with a suffix.

Let us select columns ending with “1957” and the regular expression pattern is ‘1957$’, where the dollar symbol at the end represents the pattern ending with “1957”.

We use regex=’1957$’ as argument to the Pandas’ filter function and addition to axis=1. We get a data frame with three columns that have names ending with 1957.

>gapminder.filter(regex='1957$',axis=1).head()

gdpPercap_1957	lifeExp_1957	pop_1957
0	3013.976023	45.685	10270856.0
1	3827.940465	31.999	4561361.0
2	959.601080	40.358	1925173.0

We can also combine both prefix and suffix, using appropriate regular expression, to select columns starting and ending with some prefix and suffix.

How to Select Columns Using Prefix/Suffix of Column Names with Pandas loc function?

Another way to select columns starting/ending with some prefix/suffix is to use Pandas loc function together with Pandas’ str function.

Basic idea is that Pandas str function can be used get a numpy boolean array to select column names containing or starting with or ending with some pattern. Then we can use the boolean array to select the columns using Pandas loc function.

Let us first select columns starting with prefix “lifeExp” using Pandas loc function. Let us get the boolean array from the column names to find which column names start with “lifeExp.

>gapminder.columns.str.startswith("life")

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False])

Now we can use the boolean array as argument to Pandas loc function to select the columns starting with “life”

gapminder.loc[:, gapminder.columns.str.startswith("life")]

	lifeExp_1952	lifeExp_1957	lifeExp_1962	lifeExp_1967	lifeExp_1972 .. ..
0	43.077	45.685	48.303	51.407	54.518 . . .
1	30.015	31.999	34.000	35.985	37.928 . . .
2	38.223	40.358	42.618	44.885	47.014 . . .
gapminder.loc[:, gapminder.columns.str.endswith("1957")]

     gdpPercap_1957	lifeExp_1957	pop_1957
0	3013.976023	45.685	10270856.0
1	3827.940465	31.999	4561361.0
2	959.601080	40.358	1925173.0