How To Randomly Add NaN to Pandas Dataframe?

In this post we will see an example of how to introduce missing value, i.e. NaNs randomly in a data frame uusisng Pandas.

Sometimes while testing a method, you might want to create a Pandas dataframe with NaNs randomly distributed. Here wee show how to do it.

Let us load the packages we need

import numpy as np
import pandas as pd
import seaborn as sns

Let us use gaominder data in wide form to introduce NaNs randomly. We will use the wide form gapminder data.

data_url = "https://goo.gl/ioc2Td"
gapminder = pd.read_csv(data_url)
print(gapminder.iloc[0:5,0:4])

  continent       country  gdpPercap_1952  gdpPercap_1957
0    Africa       Algeria     2449.008185     3013.976023
1    Africa        Angola     3520.610273     3827.940465
2    Africa         Benin     1062.752200      959.601080
3    Africa      Botswana      851.241141      918.232535
4    Africa  Burkina Faso      543.255241      617.183465

Let us drop two columns from the dataframe using Pandas “drop” function. Now the resulting dataframe contains data.

gapminder = gapminder.drop(['continent','country'], axis=1)

We can see that there are 5112 data points in total.

gapminder.count().sum()

Create a logical Pandas dataframe with fixed percent of TRUE/FALSEs

Let us create a boolean NumPy array of the same size as our Pandas dataframe. We create the boolean 2d-array such that it contains about 50% of its elements are True and False. We will using Numpy’s random module to create random numbers and use to create boolean array

nan_mat = np.random.random(gapminder.shape)<0.5
nan_mat

We can take a peek at the boolean array.

array([[False,  True,  True, ...,  True,  True,  True],
       [False,  True,  True, ...,  True, False,  True],
       [ True, False,  True, ..., False,  True, False],
       ...,
       [ True, False,  True, ..., False,  True,  True],
       [ True,  True, False, ..., False,  True, False],
       [ True,  True,  True, ..., False, False,  True]])

We can get the total number of True elements, i.e. total number NaNs we will be adding to the dataframe using NumPy’s sum function.

nan_mat.sum()

Introduce random NAs using Pandas mask() function

Let us use Pandas mask() function to replace values with NAs. Pandas mask()

Where cond is False, keep the original value. Where True, replace with corresponding value from other.

Pandas’ function mask checks each element in the dataframe. and it will use the element in the dataframe if the condition is False and change it to NaN if it is True. We provide the condition using logical/boolean dataframe we created.

gapminder_NaN = gapminder.mask(nan_mat)

Yes, we could have also directly applied the condition to create boolean matrix inside mask function.

gapminder_NaN = gapminder.mask(np.random.random(gapminder.shape)<0.5)

We can verify that the dataframe has NaNs introduced randomly as we intended.

gapminder_NaN.iloc[0:3,0:5]
gdpPercap_1952	gdpPercap_1957	gdpPercap_1962	gdpPercap_1967	gdpPercap_1972
0	2449.008185	NaN	NaN	3246.991771	4182.663766
1	3520.610273	NaN	NaN	NaN	NaN
2	NaN	959.60108	NaN	1035.831411	NaN

We can count the total number of nulls or NaNs and see that it is approximately about 50%.

gapminder_NaN.isnull().sum(axis = 0).sum()

In summary, we have added NaNs randomly to a Pandas dataframe. We used NumPy’s random module to create a random boolean arrays with approximately specific number of NaNs and Pandas mask fucntion to add NaNs in the dataframe.

Create a logical Pandas dataframe with fixed percent of TRUE/FALSEs

Introduce random NAs using Pandas mask() function

Share this:

Related posts: