How To Code a Character Variable into Integer in Pandas

How to Code Character Variable as Integers with Pandas?
How to Code Character Variable as Integers with Pandas?

Often while working with a Pandas dataframe containing variables of different datatypes, one might want to convert a specific character/string/Categorical variable into a numerical variable. One of the uses of such conversion is that it enables us to quickly perform correlative analysis.

In this post, we will see multiple examples of converting character variable into an integer variable in Pandas. For example, we will convert a character variable with three different values, i.e. Adelie, Gentoo, and Chinstrap, into 0/1/2. Note that this is different from converting integer values stored as character variable, like “1”, “2”, and “3” to integers 1/2/3. For that type of conversion, we can use Pandas’ as_numeric() or astype(int).

How to Code Character Variable as Integers with Pandas?

Let us load the packages needed to illustrate this.

import pandas as pd
import seaborn as sns

We will use Palmer Penguins dataset a variable from Seaborn’s inbuilt datasets.

penguins = sns.load_dataset("penguins")
penguins = penguins.dropna()

You can see that the character variables are of data types called object by default in Pandas.

penguins.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object

1. Coding Character Variable to Integers Using Pandas Series

One of the solutions to convert the character variable into integer values is to work with Series of the variable. We can get the variable of interest as Series with

penguins.species
0      Adelie
1      Adelie
2      Adelie
4      Adelie
5      Adelie
        ...  
338    Gentoo
340    Gentoo
341    Gentoo
342    Gentoo
343    Gentoo
Name: species, Length: 333, dtype: object

And then convert the character variable into a Categorical variable using Pandas astype() function.

penguins.species.astype("category")
0      Adelie
1      Adelie
2      Adelie
4      Adelie
5      Adelie
        ...  
338    Gentoo
340    Gentoo
341    Gentoo
342    Gentoo
343    Gentoo
Name: species, Length: 333, dtype: category
Categories (3, object): ['Adelie', 'Chinstrap', 'Gentoo']

Then get the integers using cat.codes on the categorical variable.

penguins.species.astype("category").cat.codes
0      0
1      0
2      0
4      0
5      0
      ..
338    2
340    2
341    2
342    2
343    2
Length: 333, dtype: int8

In order to save the converted variable as part of the original dataframe, we can re-assign as

penguins.species = penguins.species.astype("category").cat.codes

And now our updated dataframe looks like this

penguins.head()
species	island	bill_length_mm	bill_depth_mm	flipper_length_mm body_mass_g	sex
0	0	Torgersen	39.1	18.7	181.0	3750.0	Male
1	0	Torgersen	39.5	17.4	186.0	3800.0	Female
2	0	Torgersen	40.3	18.0	195.0	3250.0	Female
4	0	Torgersen	36.7	19.3	193.0	3450.0	Female
5	0	Torgersen	39.3	20.6	190.0	3650.0	Male

2. Coding Character Variable to Integers Using Pandas DataFrame

Another way to code a character variable into integer variable is to work with the variable as dataframe object. We can subset a Pandas dataframe as follows

penguins[['species']]


species
0	Adelie
1	Adelie
2	Adelie
4	Adelie
5	Adelie
...	...
338	Gentoo
340	Gentoo
341	Gentoo
342	Gentoo
343	Gentoo
333 rows × 1 columns

And then use apply() function to convert each element as integers as shown below

penguins[['species']].apply(lambda col:pd.Categorical(col).codes)
	species
0	0
1	0
2	0
4	0
5	0
...	...
338	2
340	2
341	2
342	2
343	2
333 rows × 1 columns

To save the converted variable as a variable in the dataframe, we use


penguins[['species']]=penguins[['species']].apply(lambda col:pd.Categorical(col).codes)