Selecting one or more columns from a data frame is straightforward in Pandas. For example, if we want to select multiple columns with names of the columns as a list, we can one of the methods illustrated in
Sometimes you may be working with a larger dataframe with many columns and you don’t exactly know the column names you want. However, you may know that the column names start with some prefix or end with some suffix and interested in some of those columns.
In such a scenario, basically we are interested in how to select columns using prefix or suffix of columns names in Pandas. Basically, we need to do some kind of pattern matching to identify the columns of interest.
In this post we will see examples of
- How to select multiple columns that start with a prefix?
- How to selecting columns that end with a suffix?
We will use two different ways to select columns with prefix/suffix.
We will first use Pandas filter function with some simple regular expression for pattern matching to select the columns of interest. And then, we will use Pandas’ loc function to do the same.
Let us first load Pandas package
import pandas as pd
We will use gapminder data from Carpentries website.
data_url = "https://goo.gl/ioc2Td" gapminder = pd.read_csv(data_url)
Note that this file contains gapminder data in wider form. It has 142 rows and 38 columns.
print(gapminder.head(3)) continent country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 \ 0 Africa Algeria 2449.008185 3013.976023 2550.816880 1 Africa Angola 3520.610273 3827.940465 4269.276742 2 Africa Benin 1062.752200 959.601080 949.499064
And the column names have some variable as prefixes, like gdpPercap, lifeExp, and so on. Also the column names end with numerical suffix.
Index(['continent', 'country', 'gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967', 'gdpPercap_1972', 'gdpPercap_1977'], dtype='object')
Let us first use Pandas’ filter function and regular expression pattern to select columns starting with a prefix. Pandas’ filter function takes two main arguments and one of them is regex, where we need to specify the pattern we are interested in as regular expression. And we also need to specify axis=1 to select columns.
Select Columns with a Prefix using Pandas filter
For example, if we are interested in selecting columns starting with “lifeExp”, the regular expression for the pattern is “^lifeExp”. In the regular expression “^” represents we are interested in patterns that starts with. So our argument for “regexp” will be regexp=’^lifeExp’.
gapminder.filter(regex='^lifeExp',axis=1).head()
We will get smaller dataframe with the column names starting with “lifeExp”.
lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967 lifeExp_1972 \ 0 43.077 45.685 48.303 51.407 54.518 1 30.015 31.999 34.000 35.985 37.928 2 38.223 40.358 42.618 44.885 47.014
Select Columns with a suffix using Pandas filter
Let us select columns with names ending with a suffix in Pandas dataframe using filter function. As before, we need to come up with regular expression for the pattern we are interested in. Here our pattern is column names ending with a suffix.
Let us select columns ending with “1957” and the regular expression pattern is ‘1957$’, where the dollar symbol at the end represents the pattern ending with “1957”.
We use regex=’1957$’ as argument to the Pandas’ filter function and addition to axis=1. We get a data frame with three columns that have names ending with 1957.
>gapminder.filter(regex='1957$',axis=1).head() gdpPercap_1957 lifeExp_1957 pop_1957 0 3013.976023 45.685 10270856.0 1 3827.940465 31.999 4561361.0 2 959.601080 40.358 1925173.0
We can also combine both prefix and suffix, using appropriate regular expression, to select columns starting and ending with some prefix and suffix.
How to Select Columns Using Prefix/Suffix of Column Names with Pandas loc function?
Another way to select columns starting/ending with some prefix/suffix is to use Pandas loc function together with Pandas’ str function.
Basic idea is that Pandas str function can be used get a numpy boolean array to select column names containing or starting with or ending with some pattern. Then we can use the boolean array to select the columns using Pandas loc function.
Let us first select columns starting with prefix “lifeExp” using Pandas loc function. Let us get the boolean array from the column names to find which column names start with “lifeExp.
>gapminder.columns.str.startswith("life") array([False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False, False, False, False, False, False, False, False, False, False])
Now we can use the boolean array as argument to Pandas loc function to select the columns starting with “life”
gapminder.loc[:, gapminder.columns.str.startswith("life")] lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967 lifeExp_1972 .. .. 0 43.077 45.685 48.303 51.407 54.518 . . . 1 30.015 31.999 34.000 35.985 37.928 . . . 2 38.223 40.358 42.618 44.885 47.014 . . .
gapminder.loc[:, gapminder.columns.str.endswith("1957")] gdpPercap_1957 lifeExp_1957 pop_1957 0 3013.976023 45.685 10270856.0 1 3827.940465 31.999 4561361.0 2 959.601080 40.358 1925173.0