Pandas 1.0.0 is Here: Top New Features of Pandas You Should Know

Pandas 1.0.0 is Out
Pandas 1.0.0 is ready for prime time now. Pandas project has come a long way since the early release of Pandas version 0.4 in 2011. It had contributions from 2 developers including Wes Kinney then, now Pandas has over 300 contributors.

The latest version of Pandas can be installed from standard package managers like Anaconda, miniconda, and PyPI.

Pandas team recommends users to first to upgrade to pandas 0.25 if you are not already on Pandas 0.25. And make ensure your existing code does not break, before upgrading to pandas 1.0.

# load pandas
import pandas as pd

# check pandas version
print(pd.__version__)

1.0.0

Let us see the top features of the new Pandas version 1.0.0.

1. Pandas rolling function gets faster with Numba

With Pandas 1.0, Pandas’ apply() function can make use of Numba (if installed) instead of cython and be faster. To use numba inside apply() function, one needs to specify engine=’numba’ and engine_kwargs arguments. And with using Numba, apply function is much faster on larger data sets (like a million datapoints rolling function)

Let us try an example using windows function on a large dataset from Pandas document.

data = pd.Series(range(1_000_000))
data.head()

0    0
1    1
2    2
3    3
4    4
dtype: int64

Let us apply rolling function on the data with a window length of 10.

roll = data.rolling(10)

Let us write custom function to apply with rolling.

def f(x):
 return np.sum(x) + 5

The apply function in Pandas for rolling can make use of Numba instead of Cython, if it is already installed and make the computation faster. We can use Numba by specifying engine=”numba” inside apply(). When you call apply function with numba option for the first time, it will be slight slow due to over head operations.

# Run the first time, compilation time will affect performance
%timeit -r 1 -n 1 roll.apply(f, engine='numba', raw=True)  
3.2 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Subsequent use of Numba will be faster as the function is cached.

# Function is cached and performance will improve
%timeit roll.apply(f, engine='numba', raw=True)
220 ms ± 7.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Timer with engine =’cython’ option.

In [6]: %timeit roll.apply(f, engine='cython', raw=True)
4.64 s ± 86.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2. Convert a Dataframe to Markdown

Pandas 1.0 has a new function to_markdown() that will help convert a Pandas data frame to Markdown table. For me, to_markdown() did not work initially and complained that “tabulate” is missing. After installing tabulate, with conda install tabulate to_markdown() worked fine.

Let us try an example trying to use Pandas to_markdown() function.

from vega_datasets import data
seattle_temps = data.seattle_temps()
print(seattle_temps.head())

We get nicely tabulated table as result with Pandas 1.0.0.

print(seattle_temps.head().to_markdown())

|    | date                |   temp |
|---:|:--------------------|-------:|
|  0 | 2010-01-01 00:00:00 |   39.4 |
|  1 | 2010-01-01 01:00:00 |   39.2 |
|  2 | 2010-01-01 02:00:00 |   39   |
|  3 | 2010-01-01 03:00:00 |   38.9 |
|  4 | 2010-01-01 04:00:00 |   38.8 |

3. Dedicated String Type

With Pandas 1.0.0, we get dedicated StringType for strings. Before, this string variables were all dumped under “Object”. Now string variable gets a dedicated type.

Let us try an example data frame with string variable.

df = pd.DataFrame({'a': [1, 2] * 2,
                   'b': [True, False] * 2,
                   'c': [1.0, 2.0] * 2,
                   'd': ["abc","def"]*2})
df

We can check, that Pandas assign “object” as data type for string variable “d” in our example.

df.dtypes

a      int64
b       bool
c    float64
d     object
dtype: object

Pandas 1.0.0 offers a new function convert_dtypes(), when applied on the data frame, it gives dedicated String data type to string variables.

df_new = df.convert_dtypes()

df_new.dtypes
a      Int64
b    boolean
c      Int64
d     string
dtype: object

One of the biggest advantages of having dedicated string data type is that we can select variables that are string type easily.

Here is an example of using String type to select all string variables in a data frame.

df_new.select_dtypes(include='string')

d
0	abc
1	def
2	abc
3	def

4. Pandas NA: A new way to deal with missing values

Pandas 1.0.0 also offers new unified framework for dealing with missing values as an experimental feature. Pandas 1.0.0 introduces a new pd.NA value to represent scalar missing values. pd.NA offers a single way to represent missign value across data types. Until now, Pandas had different values for representing missing value depending on the data type. For example, Pandas used NumPy’s np.nan for missing value in float data; np.nan or None for object data types and pd.NaT for datetime-like data.

Let us see an example of missing data in Pandas and . create a data frame with different data types with missing value.

df = pd.DataFrame({'a': [None, 1] * 2,
                   'b': [True, None] * 2,
                   'c': [np.nan,1.0] * 2,
                   'd': ["abc",None]*2})
df

We can see that the missing values are coded as NaN or None depending on the variable’s data type.


       a	b	c	d
0	NaN	True	NaN	abc
1	1.0	None	1.0	None
2	NaN	True	NaN	abc
3	1.0	None	1.0	None
df.dtypes
a    float64
b     object
c    float64
d     object
dtype: object

Let us print the missing value corresponding to float object.

print(df.a[0])
nan

Let us print the missing value corresponding to boolean object.

print(df.b[1])
None

Starting from Pandas 1.0.0, we can convert the missing data to pd.NA using Pandas’ convenient function convert_dtypes(). This function will make missing data from different data types to Pandas unified NA missing value.

Let us use convert_dtypes() function on our data frame. This automatically infers the data types and converts the missing values to pd.NA

df_new = df.convert_dtypes()
df_new.dtypes

We can see that in the new dataframe all missing values from different data types are represented as .

df_new

	a	b	c	d
0	<NA>	True	<NA>	abc
1	1	<NA>	1	<NA>
2	<NA>	True	<NA>	abc
3	1	<NA>	1	<NA>

With Pandas 1.0.0, we also get dedicated boolean data type, in addition to String data type as described before.

a      Int64
b    boolean
c      Int64
d     string
dtype: object

We can check that by printing missing values from a specific data type.

print(df_new.a[0])
<NA>
print(df_new.b[1])
<NA>

We can also verify the equality of missing value.

df_new.b[1] is pd.NA
True

An important feature to note is that,

Compared to np.nan, pd.NA behaves differently in certain operations. In addition to arithmetic operations, pd.NA also propagates as “missing” or “unknown” in comparison operations:

For example, if you checked “np.nan > 1” currently, you would get “False”. With the new missing value operator, if you check “pd.NA > 1”, you will get ““.

Learn more about the good behaviours of pd.NA here.

5. Extended verbose info output for DataFrame

Pandas info() function has extended verbose output now. When you use info(verbose=True), you would get index number for each row i.e. line number each variable in the data frame now.

seattle_temps.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8759 entries, 0 to 8758
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    8759 non-null   datetime64[ns]
 1   temp    8759 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 137.0 KB

Earlier Pandas version info(verbose=True) was like this without any line numbers.

RangeIndex: 8759 entries, 0 to 8758
Data columns (total 2 columns):
date    8759 non-null   datetime64[ns]
temp    8759 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 137.0 KB

6. New Enhancements with Pandas 1.0.0

Pandas 1.0.0 has added a number of new enhancements to already existing Pandas function. One of the common useful functionalities with Pandas 1.0.0 is ignore_index() keyword to reset index of data frame for the following functions

  • DataFrame.sort_values() and Series.sort_values()
  • DataFrame.sort_index() and Series.sort_index()
  • DataFrame.drop_duplicates()
    What does this mean is when you use pd.sort_values() or pd.drop_duplicates() function, currently by default you get all the index number jumbled up and not in order. With the new argument ignore_index(), now you get a dataframe with sorted/reset indices.

Let us consider an example with drop_duplicates()

df = pd.DataFrame({'a': [2, 2,3,4],
                   'b': [2, 2,3,4],
                   'c': [2, 2,3,4]})
df

	a	b	c
0	2	2	2
1	2	2	2
2	3	3	3
3	4	4	4

Let us drop duplicate rows using drop_duplicates() function in Pandas. And note that index of the dataframe after dropping the duplicates is 0,2,3 as the row 1 was the duplicate.

df.drop_duplicates()

       a	b	c
0	2	2	2
2	3	3	3
3	4	4	4

Let us use ignore_index=True argument with drop_duplicates(). We can see that we automatically get our index reset.

df.drop_duplicates(ignore_index=True)

a	b	c
0	2	2	2
1	3	3	3
2	4	4	4

7. Pandas’ New Deprecation Policy

Pandas has a new “Deprecation Policy”. Starting with Pandas 1.0.0, Pandas team will introduce deprecations in minor releases like 1.1.0 and 1.2.0. And the deprecations will be “enforced” in major releases like 1.0.0, and 2.0.0.

For example some of the features that are deprecated with Pandas 1.0.0 are

  • pandas.util.testing module has been deprecated. Instead use pandas.testing
  • pandas.SparseArray has been deprecated, instead use pandas.arrays.SparseArray (arrays.SparseArray)
  • pandas.np submodule is now deprecated, use numpy directly.
  • pandas.datetime class is now deprecated. Import from datetime instead