In this post, we will learn how to use Pandas filter() function to subset a dataframe based on its column names and row indexes. Pandas has a number of ways to subset a dataframe, but Pandas filter() function differ from others in a key way.
Pandas filter() function does not filter a dataframe on its content. It primarily use labels of dataframe to subset a dataframe.
Here we will see examples of how to is Pandas filter() function to select one or more columns using the column names and select one or more rows using row indices.
Let us first get started with loading the modules necessary.
import seaborn as sns import pandas as pd
We will use a subset of Penguins dataset available from Seaborn’s built-in dataset. Here we load the dataset and use head() function to get a few rows.
df = (sns.load_dataset("penguins"). head())
And our toy dataset look like this.
df species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male 1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female 2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female 3 Adelie Torgersen NaN NaN NaN NaN NaN 4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
Our toy dataframe’ row index is integer. For the sake of easiness, let us change the row index to be string instead of numbers in order. Let us assign new row index names using “index” method in Pandas.
# assign index names to dataframe df.index=["one","two","three","four","five"]
After changing the row indices our dataframe looks like this
df species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex one Adelie Torgersen 39.1 18.7 181.0 3750.0 Male two Adelie Torgersen 39.5 17.4 186.0 3800.0 Female three Adelie Torgersen 40.3 18.0 195.0 3250.0 Female four Adelie Torgersen NaN NaN NaN NaN NaN five Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
How to Select a Single Column using Pandas filter() function?
We can select a column from a dataframe by using the column name we want to select as a list to filter() function. In this example, we select species column from the dataframe.
df.filter(["species"]) species one Adelie two Adelie three Adelie four Adelie five Adelie
By default, filter() function selects a column when we provide the column label as a list. We can also specify axis=1 to specify we are filtering columns.
df.filter(["species"], axis=1) species one Adelie two Adelie three Adelie four Adelie five Adelie
Another way to specify we are selecting columns is to use axis=”columns” and this argument makes more sense as it clearly specifies what we are selecting.
df.filter(["species"], axis="columns") species one Adelie two Adelie three Adelie four Adelie five Adelie
How to Select a Multiple Columns using Pandas filter() function?
To select multiple columns by their column names, we should provide the list of column names as list to Pandas filter() function.
df.filter(["species", "bill_length_mm"])
species bill_length_mm one Adelie 39.1 two Adelie 39.5 three Adelie 40.3 four Adelie NaN five Adelie 36.7
How to Use Regular Expression to Select a Multiple Columns using Pandas filter() function?
We can also use regular expression to match the patterns of interest on column names and select multiple columns using Pandas filter() function.
In this example below, we select columns that ends with “mm” in the dataframe using “regex=’mm$'” as argument. We see that we have three columns that ends with “mm”.
df.filter(regex='mm$', axis="columns") bill_length_mm bill_depth_mm flipper_length_mm one 39.1 18.7 181.0 two 39.5 17.4 186.0 three 40.3 18.0 195.0 four NaN NaN NaN five 36.7 19.3 193.0
How to Use Regular Expression to Select a Multiple Rows using Pandas filter() function?
We can also use regular expression to select rows from a dataframe based on row names. In this example below, we select rows that ends with “e” using regx=’e$’ argument. Here, we also specify axis=”index” to specify we are filtering rows not columns.
df.filter(regex='e$', axis="index")
Now we have filtered rows whose row names end with letter “e”; one, three and five.
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex one Adelie Torgersen 39.1 18.7 181.0 3750.0 Male three Adelie Torgersen 40.3 18.0 195.0 3250.0 Female five Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
Sometimes you might want to select some columns and rows. We can use Pandas chain operation with multiple filter() functions to select rows and columns or columns and rows of interest.
(df. filter(regex='^f', axis="index"). filter(["species","bill_length_mm"]))
In this example, we first select some rows using regular expression and the select two columns using filter() function.
species bill_length_mm four Adelie NaN five Adelie 36.7