Creaating unbiased training and testing data sets are key for all Machine Learning tasks. Pandas’ sample function lets you randomly sample data from Pandas data frame and help with creating unbiased sampled datasets. It is a great way to get downsampled data frame and work with it.
In this post, we will learn three ways of using Pandas’ sample to randomly select/sample/resample rows. Let us first load the data.
data_url = 'http://bit.ly/2cLzoxH' # read data from url as pandas dataframe gapminder = pd.read_csv(data_url) print(gapminder.head())
How to get a random subset of data
To randomly select rows from a pandas dataframe, we can use sample function from Pandas. For example, to randomly select n=3 rows, we use sample with the argument n.
>random_subset = gapminder.sample(n=3) >print(random_subset.head()) country year pop continent lifeExp gdpPercap 578 Ghana 1962 7355248.0 Africa 46.452 1190.041118 410 Denmark 1962 4646899.0 Europe 72.350 13583.313510 100 Bangladesh 1972 70759295.0 Asia 45.252 630.233627
Every time, we run “sample” we will get randomly selected 3 rows from the Pandas dataframe.
How to sample rows with replacement in Pandas?
By default, pandas’ sample randomly selects rows without replacement. Sampling with replacement is very useful for statistical techniques like bootstrapping. If we want to randomly sample rows with replacement, we can set the argument “replace” to True.
For example, to randomly select n=3 rows with replacement from the gapminder data
>sample_with_replacement = gapminder.sample(n=3,replace=True) >print(sample_with_replacement) country year pop continent lifeExp gdpPercap 1416 Spain 1952 28549870.0 Europe 64.940 3834.034742 201 Burkina Faso 1997 10352843.0 Africa 50.324 946.294962 1187 Panama 2007 3242173.0 Americas 75.537 9809.185636
Here we have not sampled enough rows, so we did not see the same row twice.
How to randomly select a percentage of rows in Pandas dataframe?
Often, you may want to sample a percentage of data rather than a fixed number of rows. Pandas’ sample has argument “frac” that lets you specify a fraction (percentage) of rows that you want to randomly select from pandas.
>fraction_of_rows = gapminder.sample(frac=0.003) >print(fraction_of_rows) country year pop continent lifeExp gdpPercap 903 Libya 1967 1759224.0 Africa 50.227 18772.751690 1221 Philippines 1997 75012988.0 Asia 68.564 2536.534925 1565 Tunisia 1977 6005061.0 Africa 59.837 3120.876811 1003 Mongolia 1987 2015133.0 Asia 60.222 2338.008304 1157 Oman 1977 1004533.0 Asia 57.367 11848.343920
We can use replace=True option with frac option to get a percentage of rows with replacement. Note that we can not combine frac option and n option.
Another useful argument to sample is random_state. We can reproduce the same random samples by setting random number seed. For example, by specifying ‘random_state=99’ as an argument to sample, we can get the same random sample every time and help us reproduce the results.