When you load your data as Pandas dataframe, Pandas automatically assigns a datatype to the variables/columns in the data frame. For example, typically the datatypes would beint, float and object datatypes. With the recent Pandas 1.0.0, we can make Pandas infer the best datatypes for the variables in a dataframe.
We will use Pandas’ convert_dtypes() function and convert the to best data types automatically. Another big advantage of using convert_dtypes() is that it supports Pandas new type for missing values pd.NA.
Let us load Pandas and check its version.
import pandas as pd pd.__version__ 1.0.0
We will use gapminder data set located at cmdlinetips.com’s github page.
data_url = "https://raw.githubusercontent.com/cmdlinetips/data/master/gapminder-FiveYearData.csv" df = pd.read_csv(data_url) df.head()
gaopminder dataframe looks like this.
country year pop continent lifeExp gdpPercap 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030 2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710 3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138 4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106
Let us check the data types of the gapminder dataframe.
df.dtypes
We can see that some are float64, int64 and object. We can also see that string variables are of “object” data type.
country object year int64 pop float64 continent object lifeExp float64 gdpPercap float64 dtype: object
Let us use convert_dtypes() function in Pandas starting from version 1.0.0.
By default, convert_dtypes will attempt to convert a Series (or each Series in a DataFrame) to dtypes that support pd.NA. By using the options convert_string, convert_integer, and convert_boolean, it is possible to turn off individual conversions to StringDtype, the integer extension types or BooleanDtype, respectively.
Let us check the results from convert_dtypes() .
df.convert_dtypes().dtypes
We can see that convert_dtypes() function has nicely recognised the variable that are of datatype “object” and converted them to string data type.
country string year Int64 pop float64 continent string lifeExp float64 gdpPercap float64 dtype: object
This post is part of the series on Pandas 101, a tutorial covering tips and tricks on using Pandas for data munging and analysis.