How to Convert to Best Data Types Automatically in Pandas?

When you load your data as Pandas dataframe, Pandas automatically assigns a datatype to the variables/columns in the data frame. For example, typically the datatypes would beint, float and object datatypes. With the recent Pandas 1.0.0, we can make Pandas infer the best datatypes for the variables in a dataframe.

We will use Pandas’ convert_dtypes() function and convert the to best data types automatically. Another big advantage of using convert_dtypes() is that it supports Pandas new type for missing values pd.NA.

Let us load Pandas and check its version.

import pandas as pd
pd.__version__
1.0.0

We will use gapminder data set located at cmdlinetips.com’s github page.

data_url = "https://raw.githubusercontent.com/cmdlinetips/data/master/gapminder-FiveYearData.csv"
df = pd.read_csv(data_url)
df.head()

gaopminder dataframe looks like this.

country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106

Let us check the data types of the gapminder dataframe.

df.dtypes

We can see that some are float64, int64 and object. We can also see that string variables are of “object” data type.

country       object
year           int64
pop          float64
continent     object
lifeExp      float64
gdpPercap    float64
dtype: object

Let us use convert_dtypes() function in Pandas starting from version 1.0.0.

By default, convert_dtypes will attempt to convert a Series (or each Series in a DataFrame) to dtypes that support pd.NA. By using the options convert_string, convert_integer, and convert_boolean, it is possible to turn off individual conversions to StringDtype, the integer extension types or BooleanDtype, respectively.

Let us check the results from convert_dtypes() .

df.convert_dtypes().dtypes

We can see that convert_dtypes() function has nicely recognised the variable that are of datatype “object” and converted them to string data type.

country       string
year           Int64
pop          float64
continent     string
lifeExp      float64
gdpPercap    float64
dtype: object

This post is part of the series on Pandas 101, a tutorial covering tips and tricks on using Pandas for data munging and analysis.