Often you may want to create a new variable either from column names of a pandas data frame or from one of the columns of the data frame. Just like Python, Pandas has great string manipulation abilities that lets you manipulate strings easily.
Let us see an example of using Pandas to manipulate column names and a column. Let us first load Pandas and NumPy to create a Pandas data frame.
import pandas as pd import numpy as np
Let us also create a new small pandas data frame with five columns to work with. We can create the pandas data frame from multiple lists. Here one of the columns is sample IDs with two-part strings separated by underscore “_”. Two columns are integers and other two columns are random numbers generated by NumPy’s random module.
df = pd.DataFrame({'s_ID':['a_id','b_id','c_id','d_id'], 'A_1': [1, 1, 3, 2], 'A_2': [1, 2, 3, 4], 'B_1': np.random.randn(4), 'B_2': np.random.randn(4)})
The column names of the data frame also has two parts separated by underscore “_”.
A_1 A_2 B_1 B_2 s_ID 0 1 1 0.310886 -1.105775 a_id 1 1 2 -0.580864 -2.012918 b_id 2 3 3 -0.153627 -0.013282 c_id 3 2 4 -1.532841 -0.757318 d_id
We can get the names of the columns as a list from pandas dataframe using
>df.columns.tolist() ['A_1', 'A_2', 'B_1', 'B_2', 's_ID']
To split the column names and get part of it, we can use Pandas “str” function. Str function in Pandas offer fast vectorized string operations for Series and Pandas. Str returns a string object.
>df.columns.str pandas.core.strings.StringMethods at 0x113ad2780
How to Get Part of a Column Names in Pandas Data Frame?
Pandas str accessor has numerous useful methods and one of them is “split”. We can use str with split to get the first, second or nth part of the string. For example, to get the first part of the string, we will first split the string with a delimiter. Here each part of the string is separated by ““, so we can split by ““.
>df.columns.str.split('_') Index([['A', '1'], ['A', '2'], ['B', '1'], ['B', '2'], ['s', 'ID']], dtype='object')
We can see that df.columns.str.split(“_”) returns a list of lists as Pandas series object. We can easily get the first part of it using the str function again.
>df.columns.str.split('_').str[0] Index(['A', 'A', 'B', 'B', 's'], dtype='object')
If one is interested in other parts of a string, we can specify the index as argument to str function. For example, to get the second part of the string after the underscore in the column names.
>df.columns.str.split('_').str[1] Index(['1', '2', '1', '2', 'ID'], dtype='object')
Note that the result using str function to get the part of a string in vectorized fashion in Pandas results in Pandas Series object. We can easily convert to a list using tolist() function.
>df.columns.str.split('_').str[0].tolist() ['A', 'A', 'B', 'B', 's']
How to Get Part of a Column in Pandas Data Frame?
Here we just saw an example on how to get part of column names of a data frame. We can also easily get a part of any column in a Pandas data frame. For example, to get the first part of the column ‘s_ID’ in our pandas data frame, we can use
df.s_ID.str.split("_").str[0].tolist() ['a', 'b', 'c', 'd']