How To Randomly Select Rows in Pandas?

Creaating unbiased training and testing data sets are key for all Machine Learning tasks. Pandas’ sample function lets you randomly sample data from Pandas data frame and help with creating unbiased sampled datasets. It is a great way to get downsampled data frame and work with it.

In this post, we will learn three ways of using Pandas’ sample to randomly select/sample/resample rows. Let us first load the data.

data_url = 'http://bit.ly/2cLzoxH'
# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)
print(gapminder.head())

How to get a random subset of data

To randomly select rows from a pandas dataframe, we can use sample function from Pandas. For example, to randomly select n=3 rows, we use sample with the argument n.

>random_subset = gapminder.sample(n=3)
>print(random_subset.head())
        country  year         pop continent  lifeExp     gdpPercap
578       Ghana  1962   7355248.0    Africa   46.452   1190.041118
410     Denmark  1962   4646899.0    Europe   72.350  13583.313510
100  Bangladesh  1972  70759295.0      Asia   45.252    630.233627

Every time, we run “sample” we will get randomly selected 3 rows from the Pandas dataframe.

How to sample rows with replacement in Pandas?

By default, pandas’ sample randomly selects rows without replacement. Sampling with replacement is very useful for statistical techniques like bootstrapping. If we want to randomly sample rows with replacement, we can set the argument “replace” to True.

For example, to randomly select n=3 rows with replacement from the gapminder data

>sample_with_replacement = gapminder.sample(n=3,replace=True)
>print(sample_with_replacement)
           country  year         pop continent  lifeExp    gdpPercap
1416         Spain  1952  28549870.0    Europe   64.940  3834.034742
201   Burkina Faso  1997  10352843.0    Africa   50.324   946.294962
1187        Panama  2007   3242173.0  Americas   75.537  9809.185636

Here we have not sampled enough rows, so we did not see the same row twice.

How to randomly select a percentage of rows in Pandas dataframe?

Often, you may want to sample a percentage of data rather than a fixed number of rows. Pandas’ sample has argument “frac” that lets you specify a fraction (percentage) of rows that you want to randomly select from pandas.

>fraction_of_rows = gapminder.sample(frac=0.003)
>print(fraction_of_rows)
          country  year         pop continent  lifeExp     gdpPercap
903         Libya  1967   1759224.0    Africa   50.227  18772.751690
1221  Philippines  1997  75012988.0      Asia   68.564   2536.534925
1565      Tunisia  1977   6005061.0    Africa   59.837   3120.876811
1003     Mongolia  1987   2015133.0      Asia   60.222   2338.008304
1157         Oman  1977   1004533.0      Asia   57.367  11848.343920

We can use replace=True option with frac option to get a percentage of rows with replacement. Note that we can not combine frac option and n option.

Another useful argument to sample is random_state. We can reproduce the same random samples by setting random number seed. For example, by specifying ‘random_state=99’ as an argument to sample, we can get the same random sample every time and help us reproduce the results.