In this post we will see an example of how to introduce missing value, i.e. NaNs randomly in a data frame uusisng Pandas.
Sometimes while testing a method, you might want to create a Pandas dataframe with NaNs randomly distributed. Here wee show how to do it.
Let us load the packages we need
import numpy as np import pandas as pd import seaborn as sns
Let us use gaominder data in wide form to introduce NaNs randomly. We will use the wide form gapminder data.
data_url = "https://goo.gl/ioc2Td" gapminder = pd.read_csv(data_url) print(gapminder.iloc[0:5,0:4])
continent country gdpPercap_1952 gdpPercap_1957 0 Africa Algeria 2449.008185 3013.976023 1 Africa Angola 3520.610273 3827.940465 2 Africa Benin 1062.752200 959.601080 3 Africa Botswana 851.241141 918.232535 4 Africa Burkina Faso 543.255241 617.183465
Let us drop two columns from the dataframe using Pandas “drop” function. Now the resulting dataframe contains data.
gapminder = gapminder.drop(['continent','country'], axis=1)
We can see that there are 5112 data points in total.
gapminder.count().sum()
Create a logical Pandas dataframe with fixed percent of TRUE/FALSEs
Let us create a boolean NumPy array of the same size as our Pandas dataframe. We create the boolean 2d-array such that it contains about 50% of its elements are True and False. We will using Numpy’s random module to create random numbers and use to create boolean array
nan_mat = np.random.random(gapminder.shape)<0.5 nan_mat
We can take a peek at the boolean array.
array([[False, True, True, ..., True, True, True], [False, True, True, ..., True, False, True], [ True, False, True, ..., False, True, False], ..., [ True, False, True, ..., False, True, True], [ True, True, False, ..., False, True, False], [ True, True, True, ..., False, False, True]])
We can get the total number of True elements, i.e. total number NaNs we will be adding to the dataframe using NumPy’s sum function.
nan_mat.sum()
Introduce random NAs using Pandas mask() function
Let us use Pandas mask() function to replace values with NAs. Pandas mask()
Where cond is False, keep the original value. Where True, replace with corresponding value from other.
Pandas’ function mask checks each element in the dataframe. and it will use the element in the dataframe if the condition is False and change it to NaN if it is True. We provide the condition using logical/boolean dataframe we created.
gapminder_NaN = gapminder.mask(nan_mat)
Yes, we could have also directly applied the condition to create boolean matrix inside mask function.
gapminder_NaN = gapminder.mask(np.random.random(gapminder.shape)<0.5)
We can verify that the dataframe has NaNs introduced randomly as we intended.
gapminder_NaN.iloc[0:3,0:5] gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 0 2449.008185 NaN NaN 3246.991771 4182.663766 1 3520.610273 NaN NaN NaN NaN 2 NaN 959.60108 NaN 1035.831411 NaN
We can count the total number of nulls or NaNs and see that it is approximately about 50%.
gapminder_NaN.isnull().sum(axis = 0).sum()
In summary, we have added NaNs randomly to a Pandas dataframe. We used NumPy’s random module to create a random boolean arrays with approximately specific number of NaNs and Pandas mask fucntion to add NaNs in the dataframe.