Often while working with a Pandas dataframe containing variables of different datatypes, one might want to convert a specific character/string/Categorical variable into a numerical variable. One of the uses of such conversion is that it enables us to quickly perform correlative analysis.
In this post, we will see multiple examples of converting character variable into an integer variable in Pandas. For example, we will convert a character variable with three different values, i.e. Adelie, Gentoo, and Chinstrap, into 0/1/2. Note that this is different from converting integer values stored as character variable, like “1”, “2”, and “3” to integers 1/2/3. For that type of conversion, we can use Pandas’ as_numeric() or astype(int).
Let us load the packages needed to illustrate this.
import pandas as pd import seaborn as sns
We will use Palmer Penguins dataset a variable from Seaborn’s inbuilt datasets.
penguins = sns.load_dataset("penguins") penguins = penguins.dropna()
You can see that the character variables are of data types called object by default in Pandas.
penguins.dtypes species object island object bill_length_mm float64 bill_depth_mm float64 flipper_length_mm float64 body_mass_g float64 sex object dtype: object
1. Coding Character Variable to Integers Using Pandas Series
One of the solutions to convert the character variable into integer values is to work with Series of the variable. We can get the variable of interest as Series with
penguins.species
0 Adelie 1 Adelie 2 Adelie 4 Adelie 5 Adelie ... 338 Gentoo 340 Gentoo 341 Gentoo 342 Gentoo 343 Gentoo Name: species, Length: 333, dtype: object
And then convert the character variable into a Categorical variable using Pandas astype() function.
penguins.species.astype("category")
0 Adelie 1 Adelie 2 Adelie 4 Adelie 5 Adelie ... 338 Gentoo 340 Gentoo 341 Gentoo 342 Gentoo 343 Gentoo Name: species, Length: 333, dtype: category Categories (3, object): ['Adelie', 'Chinstrap', 'Gentoo']
Then get the integers using cat.codes on the categorical variable.
penguins.species.astype("category").cat.codes
0 0 1 0 2 0 4 0 5 0 .. 338 2 340 2 341 2 342 2 343 2 Length: 333, dtype: int8
In order to save the converted variable as part of the original dataframe, we can re-assign as
penguins.species = penguins.species.astype("category").cat.codes
And now our updated dataframe looks like this
penguins.head() species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 0 0 Torgersen 39.1 18.7 181.0 3750.0 Male 1 0 Torgersen 39.5 17.4 186.0 3800.0 Female 2 0 Torgersen 40.3 18.0 195.0 3250.0 Female 4 0 Torgersen 36.7 19.3 193.0 3450.0 Female 5 0 Torgersen 39.3 20.6 190.0 3650.0 Male
2. Coding Character Variable to Integers Using Pandas DataFrame
Another way to code a character variable into integer variable is to work with the variable as dataframe object. We can subset a Pandas dataframe as follows
penguins[['species']] species 0 Adelie 1 Adelie 2 Adelie 4 Adelie 5 Adelie ... ... 338 Gentoo 340 Gentoo 341 Gentoo 342 Gentoo 343 Gentoo 333 rows × 1 columns
And then use apply() function to convert each element as integers as shown below
penguins[['species']].apply(lambda col:pd.Categorical(col).codes)
species 0 0 1 0 2 0 4 0 5 0 ... ... 338 2 340 2 341 2 342 2 343 2 333 rows × 1 columns
To save the converted variable as a variable in the dataframe, we use
penguins[['species']]=penguins[['species']].apply(lambda col:pd.Categorical(col).codes)