Pandas 1.0.0 is ready for prime time now. Pandas project has come a long way since the early release of Pandas version 0.4 in 2011. It had contributions from 2 developers including Wes Kinney then, now Pandas has over 300 contributors.
The latest version of Pandas can be installed from standard package managers like Anaconda, miniconda, and PyPI.
Pandas team recommends users to first to upgrade to pandas 0.25 if you are not already on Pandas 0.25. And make ensure your existing code does not break, before upgrading to pandas 1.0.
# load pandas import pandas as pd # check pandas version print(pd.__version__) 1.0.0
Let us see the top features of the new Pandas version 1.0.0.
1. Pandas rolling function gets faster with Numba
With Pandas 1.0, Pandas’ apply() function can make use of Numba (if installed) instead of cython and be faster. To use numba inside apply() function, one needs to specify engine=’numba’ and engine_kwargs arguments. And with using Numba, apply function is much faster on larger data sets (like a million datapoints rolling function)
Let us try an example using windows function on a large dataset from Pandas document.
data = pd.Series(range(1_000_000)) data.head() 0 0 1 1 2 2 3 3 4 4 dtype: int64
Let us apply rolling function on the data with a window length of 10.
roll = data.rolling(10)
Let us write custom function to apply with rolling.
def f(x): return np.sum(x) + 5
The apply function in Pandas for rolling can make use of Numba instead of Cython, if it is already installed and make the computation faster. We can use Numba by specifying engine=”numba” inside apply(). When you call apply function with numba option for the first time, it will be slight slow due to over head operations.
# Run the first time, compilation time will affect performance %timeit -r 1 -n 1 roll.apply(f, engine='numba', raw=True) 3.2 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Subsequent use of Numba will be faster as the function is cached.
# Function is cached and performance will improve %timeit roll.apply(f, engine='numba', raw=True) 220 ms ± 7.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Timer with engine =’cython’ option.
In [6]: %timeit roll.apply(f, engine='cython', raw=True) 4.64 s ± 86.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2. Convert a Dataframe to Markdown
Pandas 1.0 has a new function to_markdown() that will help convert a Pandas data frame to Markdown table. For me, to_markdown() did not work initially and complained that “tabulate” is missing. After installing tabulate, with conda install tabulate to_markdown() worked fine.
Let us try an example trying to use Pandas to_markdown() function.
from vega_datasets import data seattle_temps = data.seattle_temps() print(seattle_temps.head())
We get nicely tabulated table as result with Pandas 1.0.0.
print(seattle_temps.head().to_markdown()) | | date | temp | |---:|:--------------------|-------:| | 0 | 2010-01-01 00:00:00 | 39.4 | | 1 | 2010-01-01 01:00:00 | 39.2 | | 2 | 2010-01-01 02:00:00 | 39 | | 3 | 2010-01-01 03:00:00 | 38.9 | | 4 | 2010-01-01 04:00:00 | 38.8 |
3. Dedicated String Type
With Pandas 1.0.0, we get dedicated StringType for strings. Before, this string variables were all dumped under “Object”. Now string variable gets a dedicated type.
Let us try an example data frame with string variable.
df = pd.DataFrame({'a': [1, 2] * 2, 'b': [True, False] * 2, 'c': [1.0, 2.0] * 2, 'd': ["abc","def"]*2}) df
We can check, that Pandas assign “object” as data type for string variable “d” in our example.
df.dtypes a int64 b bool c float64 d object dtype: object
Pandas 1.0.0 offers a new function convert_dtypes(), when applied on the data frame, it gives dedicated String data type to string variables.
df_new = df.convert_dtypes() df_new.dtypes a Int64 b boolean c Int64 d string dtype: object
One of the biggest advantages of having dedicated string data type is that we can select variables that are string type easily.
Here is an example of using String type to select all string variables in a data frame.
df_new.select_dtypes(include='string') d 0 abc 1 def 2 abc 3 def
4. Pandas NA: A new way to deal with missing values
Pandas 1.0.0 also offers new unified framework for dealing with missing values as an experimental feature. Pandas 1.0.0 introduces a new pd.NA value to represent scalar missing values. pd.NA offers a single way to represent missign value across data types. Until now, Pandas had different values for representing missing value depending on the data type. For example, Pandas used NumPy’s np.nan for missing value in float data; np.nan or None for object data types and pd.NaT for datetime-like data.
Let us see an example of missing data in Pandas and . create a data frame with different data types with missing value.
df = pd.DataFrame({'a': [None, 1] * 2, 'b': [True, None] * 2, 'c': [np.nan,1.0] * 2, 'd': ["abc",None]*2}) df
We can see that the missing values are coded as NaN or None depending on the variable’s data type.
a b c d 0 NaN True NaN abc 1 1.0 None 1.0 None 2 NaN True NaN abc 3 1.0 None 1.0 None
df.dtypes a float64 b object c float64 d object dtype: object
Let us print the missing value corresponding to float object.
print(df.a[0]) nan
Let us print the missing value corresponding to boolean object.
print(df.b[1]) None
Starting from Pandas 1.0.0, we can convert the missing data to pd.NA using Pandas’ convenient function convert_dtypes(). This function will make missing data from different data types to Pandas unified NA missing value.
Let us use convert_dtypes() function on our data frame. This automatically infers the data types and converts the missing values to pd.NA
df_new = df.convert_dtypes() df_new.dtypes
We can see that in the new dataframe all missing values from different data types are represented as
df_new a b c d 0 <NA> True <NA> abc 1 1 <NA> 1 <NA> 2 <NA> True <NA> abc 3 1 <NA> 1 <NA>
With Pandas 1.0.0, we also get dedicated boolean data type, in addition to String data type as described before.
a Int64 b boolean c Int64 d string dtype: object
We can check that by printing missing values from a specific data type.
print(df_new.a[0]) <NA>
print(df_new.b[1]) <NA>
We can also verify the equality of missing value.
df_new.b[1] is pd.NA True
An important feature to note is that,
Compared to np.nan, pd.NA behaves differently in certain operations. In addition to arithmetic operations, pd.NA also propagates as “missing” or “unknown” in comparison operations:
For example, if you checked “np.nan > 1” currently, you would get “False”. With the new missing value operator, if you check “pd.NA > 1”, you will get “
Learn more about the good behaviours of pd.NA here.
5. Extended verbose info output for DataFrame
Pandas info() function has extended verbose output now. When you use info(verbose=True), you would get index number for each row i.e. line number each variable in the data frame now.
seattle_temps.info(verbose=True) <class 'pandas.core.frame.DataFrame'> RangeIndex: 8759 entries, 0 to 8758 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 8759 non-null datetime64[ns] 1 temp 8759 non-null float64 dtypes: datetime64[ns](1), float64(1) memory usage: 137.0 KB
Earlier Pandas version info(verbose=True) was like this without any line numbers.
RangeIndex: 8759 entries, 0 to 8758 Data columns (total 2 columns): date 8759 non-null datetime64[ns] temp 8759 non-null float64 dtypes: datetime64[ns](1), float64(1) memory usage: 137.0 KB
6. New Enhancements with Pandas 1.0.0
Pandas 1.0.0 has added a number of new enhancements to already existing Pandas function. One of the common useful functionalities with Pandas 1.0.0 is ignore_index() keyword to reset index of data frame for the following functions
- DataFrame.sort_values() and Series.sort_values()
- DataFrame.sort_index() and Series.sort_index()
- DataFrame.drop_duplicates()
What does this mean is when you use pd.sort_values() or pd.drop_duplicates() function, currently by default you get all the index number jumbled up and not in order. With the new argument ignore_index(), now you get a dataframe with sorted/reset indices.
Let us consider an example with drop_duplicates()
df = pd.DataFrame({'a': [2, 2,3,4], 'b': [2, 2,3,4], 'c': [2, 2,3,4]}) df a b c 0 2 2 2 1 2 2 2 2 3 3 3 3 4 4 4
Let us drop duplicate rows using drop_duplicates() function in Pandas. And note that index of the dataframe after dropping the duplicates is 0,2,3 as the row 1 was the duplicate.
df.drop_duplicates() a b c 0 2 2 2 2 3 3 3 3 4 4 4
Let us use ignore_index=True argument with drop_duplicates(). We can see that we automatically get our index reset.
df.drop_duplicates(ignore_index=True) a b c 0 2 2 2 1 3 3 3 2 4 4 4
7. Pandas’ New Deprecation Policy
Pandas has a new “Deprecation Policy”. Starting with Pandas 1.0.0, Pandas team will introduce deprecations in minor releases like 1.1.0 and 1.2.0. And the deprecations will be “enforced” in major releases like 1.0.0, and 2.0.0.
For example some of the features that are deprecated with Pandas 1.0.0 are