Python Built-in Datasets - Python and R Tips

Scikit-learn, a machine learning toolkit in Python, offers a number of datasets ready to use for learning ML and developing new methodologies. If you are new to sklearn, it may be little harder to wrap your head around knowing the available datasets, what information is available as part of the dataset and how to access the datasets. sckit-learn’s user guide has a great guide on the datasets. Here is a quick summary of the available datasets and how to get started using them quickly.

First let us import scikit-learn and verify its version. Here we have sklearn v 1.0.

# import scikit-learn
import sklearn

# Check the version sklearn
sklearn.__version__

'1.0'

Scikit-learn’s “datasets” package offers us ways to get datasets from sklearn. Broadly, scikit-learn has three broad categories of datasets, small “Toy Datasets” are built-in, slightly larger “Real World datasets” can be downloaded through scikit-learn API, and simulated datasets or generated datasets using random variables for understanding multiple Machine Learning algorithms.

Let us import ‘datasets’ from sklearn.

# load datasets package from scikit-learn
from sklearn import datasets

Then we can use dir() function to check all the attributes associated with datasets. We are mainly intersted in the names of the datasets that are part of the datasets package.

dir(datasets)

It is going to give us a long list of attributes in datasets including all the dataset accessor names.

Load Toy Datasets in sklearn

To see the list of the “Toy Datasets” in the datasets package, we use list comprehension to filter the dataset names that starts with “load”. And we can see the list of built-in datasets available in scikit-learn.

[data for data in dir(datasets) if data.startswith("load")]

['load_boston',
 'load_breast_cancer',
 'load_diabetes',
 'load_digits',
 'load_files',
 'load_iris',
 'load_linnerud',
 'load_sample_image',
 'load_sample_images',
 'load_svmlight_file',
 'load_svmlight_files',
 'load_wine']

Each one the above is a dataset that in built-in.

How to Load a “Toy Datasets” in scikit-learn

Now that we know the list of all toy datasets readily available in sklearn, let us see how to load or access one of the datasets.

Let us see how to load the classic iris dataset using load_iris() method on “datasets” package.

iris= datasets.load_iris()

Scikit-learn stores each of the dataset in a dictionary like structure. We can look at the attributes of iris data set using dir() function as before.

dir(iris)

['DESCR',
 'data',
 'data_module',
 'feature_names',
 'filename',
 'frame',
 'target',
 'target_names']

Since it is a dictionary like object, we can access each one of attributes like DESCR, data and target using “dot” operator or using square bracket notation.

For example, we can get the decription of the data using iris.DESCR (or iris[‘DESCR’]).

print(iris.DESCR)


.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica

To access the data, we use iris[‘data’] and it gives the data as a numpy 2D array.

iris['data'][0:5,]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

By using iris[‘feature_names’], we can get the feature names or the column names of the data.

iris['feature_names']

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

Similarly, we get the target group by using iris[‘target’].

iris['target']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

And we can get the names of the target groups using iris[‘target_names’] as shown below.

iris['target_names']

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

List of all Real World Datasets available in sklearn

Similarly, we can see the list of all larger “Real World” datasets available in datasets package by filtering with names starting with “fetch”. These are slightly bigger datasets and we can download these datasets using their names with Scikit-learn’s datasets API.

[data for data in dir(datasets) if data.startswith("fetch")]

['fetch_20newsgroups',
 'fetch_20newsgroups_vectorized',
 'fetch_california_housing',
 'fetch_covtype',
 'fetch_kddcup99',
 'fetch_lfw_pairs',
 'fetch_lfw_people',
 'fetch_olivetti_faces',
 'fetch_openml',
 'fetch_rcv1',
 'fetch_species_distributions']

How to Load a “Real World Dataset” in scikit-learn

For example, to download California housing dataset, we use “fetch_california_housing()” and it gives the data in a similar dictionary like structure format.

ca_housing = datasets.fetch_california_housing()

We can see the list of all the attributes using dir() function as before.

dir(ca_housing)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

And access the data either using “dot” notation or square bracket notation. The data is stored as a Numpy Array.

ca_housing['data'][0:3,]

array([[ 8.32520000e+00,  4.10000000e+01,  6.98412698e+00,
         1.02380952e+00,  3.22000000e+02,  2.55555556e+00,
         3.78800000e+01, -1.22230000e+02],
       [ 8.30140000e+00,  2.10000000e+01,  6.23813708e+00,
         9.71880492e-01,  2.40100000e+03,  2.10984183e+00,
         3.78600000e+01, -1.22220000e+02],
       [ 7.25740000e+00,  5.20000000e+01,  8.28813559e+00,
         1.07344633e+00,  4.96000000e+02,  2.80225989e+00,
         3.78500000e+01, -1.22240000e+02]])

The attribute “feature_names” gives us the column names of the dataset.

ca_housing['feature_names']

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

ca_housing['target']

array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])

ca_housing['target_names']
['MedHouseVal']

List of all simulated Datasets available in sklearn

In addition to the toy datasets and realworld datasets, sklearn also has numerous simulated datasets that is useful for learning and testing the variety of Machine Learning algoirthms. All of these “generated” datasets name start with “make”.
Here is the list of all simulated datasets available in Scikit-learn.

[data for data in dir(datasets) if data.startswith("make")]

['make_biclusters',
 'make_blobs',
 'make_checkerboard',
 'make_circles',
 'make_classification',
 'make_friedman1',
 'make_friedman2',
 'make_friedman3',
 'make_gaussian_quantiles',
 'make_hastie_10_2',
 'make_low_rank_matrix',
 'make_moons',
 'make_multilabel_classification',
 'make_regression',
 'make_s_curve',
 'make_sparse_coded_signal',
 'make_sparse_spd_matrix',
 'make_sparse_uncorrelated',
 'make_spd_matrix',
 'make_swiss_roll']

How to Get Simulated Data in scikit-learn

Let us see a quick example of loading one of the simulated datasets, make_regression(). Here we generate 20 datapoints with noise and store them as X, Y and coef.

X,Y,coef = sklearn.datasets.make_regression(n_samples=20,
    n_features=1,
    n_informative=1,
    noise=10,
    coef=True,
    random_state=0)

Our data data looks like this.


X
array([[-0.15135721],
       [ 0.40015721],
       [ 0.97873798],
       [-0.85409574],
       [-0.97727788],
       [ 0.3130677 ],
       [-0.10321885],
       [-0.20515826],
       [ 0.33367433],
       [ 1.49407907],
       [ 0.95008842],
       [ 0.12167502],
       [ 1.45427351],
       [ 1.86755799],
       [ 0.14404357],
       [ 0.4105985 ],
       [ 0.76103773],
       [ 2.2408932 ],
       [ 0.44386323],
       [ 1.76405235]])

Y
array([-1.69610717, 12.54205757, -1.60443615, -5.84638325,  1.13431316,
       -6.37007753, 13.1477283 , -7.56606655, -0.91184146, 23.17198001,
       10.28925578, 15.69897406, 22.34013972, 24.35056259,  7.72931233,
       21.2363558 ,  0.12694595, 26.45696448, 24.23776581, 25.62265958])

coef
array(14.33532874)

Scikit-learn datasets using fetch_openml()

Another way to get data is to use fetch_openmal(). Here is an example downloading housing data using fetch_openml().

from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)

dir(housing)
['DESCR',
 'categories',
 'data',
 'details',
 'feature_names',
 'frame',
 'target',
 'target_names',
 'url']

One of the advantages in getting data with open_fetchml() is that we get the data as Pandas dataframe.

housing['data'].head()

Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1.0	60.0	RL	65.0	8450.0	Pave	None	Reg	Lvl	AllPub	...	0.0	0.0	None	None	None	0.0	2.0	2008.0	WD	Normal
1	2.0	20.0	RL	80.0	9600.0	Pave	None	Reg	Lvl	AllPub	...	0.0	0.0	None	None	None	0.0	5.0	2007.0	WD	Normal
2	3.0	60.0	RL	68.0	11250.0	Pave	None	IR1	Lvl	AllPub	...	0.0	0.0	None	None	None	0.0	9.0	2008.0	WD	Normal
3	4.0	70.0	RL	60.0	9550.0	Pave	None	IR1	Lvl	AllPub	...	0.0	0.0	None	None	None	0.0	2.0	2006.0	WD	Abnorml
4	5.0	60.0	RL	84.0	14260.0	Pave	None	IR1	Lvl	AllPub	...	0.0	0.0	None	None	None	0.0	12.0	2008.0	WD	Normal
5 rows × 80 columns