If you are like me, you might have missed that the fantastic Pandas team has released the new version Pandas 0.25.0.
As one would expect, there are quite a few new things in Pandas 0.25.0. A couple of new enhancements are around pandas’ groupby aggregation. Here are a few new things that look really interesting.
To get started with pandas version 0.25.0, install
python3 -m pip install --upgrade pandas
And load the new version of pandas.
import pandas as pd # make sure the version is pandas 0.25.0 pd.__version__
Named Aggregation with groupby
One of the interesting updates is a new groupby behavior, known as “named aggregation”. This helps naming the output columns when applying multiple aggregation functions to specific columns.
animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'], 'height': [9.1, 6.0, 9.5, 34.0], 'weight': [7.9, 7.5, 9.9, 198.0]})
For example, if we want to compute both minimum and maximum values of height for each aniumal kind and keep them as resulting column, we can use pd.NamedAgg function as follows.
animals.groupby("kind").agg( min_height=pd.NamedAgg(column='height', aggfunc='min'), max_height=pd.NamedAgg(column='height', aggfunc='max'))
And we would get
min_height max_height kind cat 9.1 9.5 dog 6.0 34.0
In addition to explicitly using pd.NameddAgg() function, we can also providethe desired columns names as the **kwargs to .agg. However, the values of **kwargs should be tuples where the first element is the column selection, and the second element is the aggregation function to apply.
We will get the same result as above using the following code
animals.groupby("kind").agg( min_height=('height', 'min'), max_height=('height', 'max'))
Explode function to split list-like values to separate rows
Another interesting function in Pandas 0.25.0 is explode() method available for both Series and DataFrame objects.
For example, you might have a dataframe with a column, whose values contain multiple items separated by a delimiter. Basically, the values of the column are like a list. Sometimes you might want the elements of list to be a separate row.
This new explode() function is sort of like the new separate_rows() function in tidyverse.
Here is an example of dataframe with comma separated string in a column. And how explode can be useful in splitting them in to a separate row.
df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1}, {'var1': 'd,e,f', 'var2': 2}]) var1 var2 0 a,b,c 1 1 d,e,f 2
And we can split the comma separated column values as rows.
df.assign(var1=df.var1.str.split(',')).explode('var1') var1 var2 0 a 1 0 b 1 0 c 1 1 d 2 1 e 2 1 f 2
SparseDataFrame is Deprecated
Another interesting change is Pandas’ SparseDataFrame subclass (and SparseSeries) is deprecated. Instead, the DataFrame function can directly take sparse values as input.
Instead of using SparseDataFrame to create a sparse dataframe like
# Old Way pd.SparseDataFrame({"A": [0, 1]})
in the new version of pandas, one would use
# New Way pd.DataFrame({"A": pd.SparseArray([0, 1])})
Similarly, there is a new way for dealing with sparse matrix in Pandas.
Instead of the old approach
# Old way from scipy import sparse mat = sparse.eye(3) df = pd.SparseDataFrame(mat, columns=['A', 'B', 'C'])
the new version of Pandas offers
# New way df = pd.DataFrame.sparse.from_spmatrix(mat, columns=['A', 'B', 'C'])