• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / Python / Python Built-in Datasets

Python Built-in Datasets

November 20, 2021 by cmdlinetips

Scikit-learn Datasets
Scikit-learn Datasets
Scikit-learn, a machine learning toolkit in Python, offers a number of datasets ready to use for learning ML and developing new methodologies. If you are new to sklearn, it may be little harder to wrap your head around knowing the available datasets, what information is available as part of the dataset and how to access the datasets. sckit-learn’s user guide has a great guide on the datasets. Here is a quick summary of the available datasets and how to get started using them quickly.

First let us import scikit-learn and verify its version. Here we have sklearn v 1.0.

# import scikit-learn
import sklearn

# Check the version sklearn
sklearn.__version__

'1.0'

Scikit-learn’s “datasets” package offers us ways to get datasets from sklearn. Broadly, scikit-learn has three broad categories of datasets, small “Toy Datasets” are built-in, slightly larger “Real World datasets” can be downloaded through scikit-learn API, and simulated datasets or generated datasets using random variables for understanding multiple Machine Learning algorithms.

Let us import ‘datasets’ from sklearn.

# load datasets package from scikit-learn
from sklearn import datasets

Then we can use dir() function to check all the attributes associated with datasets. We are mainly intersted in the names of the datasets that are part of the datasets package.

dir(datasets)

It is going to give us a long list of attributes in datasets including all the dataset accessor names.

Load Toy Datasets in sklearn

To see the list of the “Toy Datasets” in the datasets package, we use list comprehension to filter the dataset names that starts with “load”. And we can see the list of built-in datasets available in scikit-learn.

[data for data in dir(datasets) if data.startswith("load")]

['load_boston',
 'load_breast_cancer',
 'load_diabetes',
 'load_digits',
 'load_files',
 'load_iris',
 'load_linnerud',
 'load_sample_image',
 'load_sample_images',
 'load_svmlight_file',
 'load_svmlight_files',
 'load_wine']

Each one the above is a dataset that in built-in.

How to Load a “Toy Datasets” in scikit-learn

Now that we know the list of all toy datasets readily available in sklearn, let us see how to load or access one of the datasets.

Let us see how to load the classic iris dataset using load_iris() method on “datasets” package.

iris= datasets.load_iris()

Scikit-learn stores each of the dataset in a dictionary like structure. We can look at the attributes of iris data set using dir() function as before.

dir(iris)

['DESCR',
 'data',
 'data_module',
 'feature_names',
 'filename',
 'frame',
 'target',
 'target_names']

Since it is a dictionary like object, we can access each one of attributes like DESCR, data and target using “dot” operator or using square bracket notation.

For example, we can get the decription of the data using iris.DESCR (or iris[‘DESCR’]).

print(iris.DESCR)


.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica

To access the data, we use iris[‘data’] and it gives the data as a numpy 2D array.

iris['data'][0:5,]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

By using iris[‘feature_names’], we can get the feature names or the column names of the data.

iris['feature_names']

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

Similarly, we get the target group by using iris[‘target’].

iris['target']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

And we can get the names of the target groups using iris[‘target_names’] as shown below.

iris['target_names']

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

List of all Real World Datasets available in sklearn

Similarly, we can see the list of all larger “Real World” datasets available in datasets package by filtering with names starting with “fetch”. These are slightly bigger datasets and we can download these datasets using their names with Scikit-learn’s datasets API.

[data for data in dir(datasets) if data.startswith("fetch")]

['fetch_20newsgroups',
 'fetch_20newsgroups_vectorized',
 'fetch_california_housing',
 'fetch_covtype',
 'fetch_kddcup99',
 'fetch_lfw_pairs',
 'fetch_lfw_people',
 'fetch_olivetti_faces',
 'fetch_openml',
 'fetch_rcv1',
 'fetch_species_distributions']

How to Load a “Real World Dataset” in scikit-learn

For example, to download California housing dataset, we use “fetch_california_housing()” and it gives the data in a similar dictionary like structure format.

ca_housing = datasets.fetch_california_housing()

We can see the list of all the attributes using dir() function as before.

dir(ca_housing)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

And access the data either using “dot” notation or square bracket notation. The data is stored as a Numpy Array.

ca_housing['data'][0:3,]

array([[ 8.32520000e+00,  4.10000000e+01,  6.98412698e+00,
         1.02380952e+00,  3.22000000e+02,  2.55555556e+00,
         3.78800000e+01, -1.22230000e+02],
       [ 8.30140000e+00,  2.10000000e+01,  6.23813708e+00,
         9.71880492e-01,  2.40100000e+03,  2.10984183e+00,
         3.78600000e+01, -1.22220000e+02],
       [ 7.25740000e+00,  5.20000000e+01,  8.28813559e+00,
         1.07344633e+00,  4.96000000e+02,  2.80225989e+00,
         3.78500000e+01, -1.22240000e+02]])

The attribute “feature_names” gives us the column names of the dataset.

ca_housing['feature_names']

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']
ca_housing['target']

array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])
ca_housing['target_names']
['MedHouseVal']

List of all simulated Datasets available in sklearn

In addition to the toy datasets and realworld datasets, sklearn also has numerous simulated datasets that is useful for learning and testing the variety of Machine Learning algoirthms. All of these “generated” datasets name start with “make”.
Here is the list of all simulated datasets available in Scikit-learn.

[data for data in dir(datasets) if data.startswith("make")]

['make_biclusters',
 'make_blobs',
 'make_checkerboard',
 'make_circles',
 'make_classification',
 'make_friedman1',
 'make_friedman2',
 'make_friedman3',
 'make_gaussian_quantiles',
 'make_hastie_10_2',
 'make_low_rank_matrix',
 'make_moons',
 'make_multilabel_classification',
 'make_regression',
 'make_s_curve',
 'make_sparse_coded_signal',
 'make_sparse_spd_matrix',
 'make_sparse_uncorrelated',
 'make_spd_matrix',
 'make_swiss_roll']

How to Get Simulated Data in scikit-learn

Let us see a quick example of loading one of the simulated datasets, make_regression(). Here we generate 20 datapoints with noise and store them as X, Y and coef.

X,Y,coef = sklearn.datasets.make_regression(n_samples=20,
    n_features=1,
    n_informative=1,
    noise=10,
    coef=True,
    random_state=0)

Our data data looks like this.


X
array([[-0.15135721],
       [ 0.40015721],
       [ 0.97873798],
       [-0.85409574],
       [-0.97727788],
       [ 0.3130677 ],
       [-0.10321885],
       [-0.20515826],
       [ 0.33367433],
       [ 1.49407907],
       [ 0.95008842],
       [ 0.12167502],
       [ 1.45427351],
       [ 1.86755799],
       [ 0.14404357],
       [ 0.4105985 ],
       [ 0.76103773],
       [ 2.2408932 ],
       [ 0.44386323],
       [ 1.76405235]])

Y
array([-1.69610717, 12.54205757, -1.60443615, -5.84638325,  1.13431316,
       -6.37007753, 13.1477283 , -7.56606655, -0.91184146, 23.17198001,
       10.28925578, 15.69897406, 22.34013972, 24.35056259,  7.72931233,
       21.2363558 ,  0.12694595, 26.45696448, 24.23776581, 25.62265958])
coef
array(14.33532874)

Scikit-learn datasets using fetch_openml()

Another way to get data is to use fetch_openmal(). Here is an example downloading housing data using fetch_openml().

from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)
dir(housing)
['DESCR',
 'categories',
 'data',
 'details',
 'feature_names',
 'frame',
 'target',
 'target_names',
 'url']

One of the advantages in getting data with open_fetchml() is that we get the data as Pandas dataframe.

housing['data'].head()

Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1.0	60.0	RL	65.0	8450.0	Pave	None	Reg	Lvl	AllPub	...	0.0	0.0	None	None	None	0.0	2.0	2008.0	WD	Normal
1	2.0	20.0	RL	80.0	9600.0	Pave	None	Reg	Lvl	AllPub	...	0.0	0.0	None	None	None	0.0	5.0	2007.0	WD	Normal
2	3.0	60.0	RL	68.0	11250.0	Pave	None	IR1	Lvl	AllPub	...	0.0	0.0	None	None	None	0.0	9.0	2008.0	WD	Normal
3	4.0	70.0	RL	60.0	9550.0	Pave	None	IR1	Lvl	AllPub	...	0.0	0.0	None	None	None	0.0	2.0	2006.0	WD	Abnorml
4	5.0	60.0	RL	84.0	14260.0	Pave	None	IR1	Lvl	AllPub	...	0.0	0.0	None	None	None	0.0	12.0	2008.0	WD	Normal
5 rows × 80 columns

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Default Thumbnailvega_datasets: A Python Package for Datasets Default ThumbnailDimensionality Reduction with tSNE in Python Correlation Heatmap: Lower Triangle with SeabornHow To Make Lower Triangle Heatmap with Correlation Matrix in Python? PCA Example in Python with scikit-learn

Filed Under: Python Tagged With: sklearn datasets

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version