When you are trying to learn the basics of data science or trying out a new Machine Learning algorithm, an important thing you need is a suitable real world dataset. Often, getting the data set in the right format may be tricky and one may need to spend too much time to search, download and clean it.
Jake Vanderplas, the author of Python Data Science Handbook: Essential Tools for Working with Data, has solved the dataset problem with a relatively new package called vega_datasets. It is a Python package one can easily install using pip and provide access to over 60 datasets of varying sizes. When you load a dataset from the package, it results in a nice pandas dataframe.
How To Install Vega Datasets?
vega_datasets can be easily installed using pip.
pip install vega_datasets
After installing, you can import it to have access to all the datasets.
from vega_datasets import data import pandas as pd
Some datasets are locally available and one can readily access without an internet connection. One can see which datasets are locally available with local_data
from vega_datasets import local_data local_data.list_datasets() ['airports', 'anscombe', 'barley', 'burtin', 'cars', 'crimea', 'driving', 'iris', 'seattle-temps', 'seattle-weather', 'sf-temps', 'stocks']
You can also easily find all available datasets with list_datasets(). There are over 60 datasets available in total.
len(data.list_datasets()) 62
Let us have a quick look at some of the available vega datasets.
Iris dataset
iris = data.iris() iris.head()
Gapminder dataset
gapminder = data.gapminder() gapminder.head()
Airports dataset
airports=data.airports() print(airports.head())
US State Capital dataset
capitals = data.us_state_capitals()
San Franciso Temperature
sf_temps =data.sf_temps() print(sf_temps.head()) temp date 0 47.8 2010-01-01 00:00:00 1 47.4 2010-01-01 01:00:00 2 46.9 2010-01-01 02:00:00 3 46.5 2010-01-01 03:00:00 4 46.0 2010-01-01 04:00:00
Seattle Temperature
seattle_temps = data.seattle_temps() print(seattle_temps.head()) date temp 0 2010-01-01 00:00:00 39.4 1 2010-01-01 01:00:00 39.2 2 2010-01-01 02:00:00 39.0 3 2010-01-01 03:00:00 38.9 4 2010-01-01 04:00:00 38.8
Seattle Weather
seattle_weather = data.seattle_weather() print(seattle_weather.head(n=3)) date precipitation temp_max temp_min wind weather 0 2012-01-01 0.0 12.8 5.0 4.7 drizzle 1 2012-01-02 10.9 10.6 2.8 4.5 rain 2 2012-01-03 0.8 11.7 7.2 2.3 rain
Flights data
flights_3m = data.flights_3m() print(flights_3m.head()) print(flights_3m.shape) date delay distance origin destination 0 1010001 14 405 MCI MDW 1 1010530 -11 370 LAX PHX 2 1010540 5 389 ONT SMF 3 1010600 -5 337 OAK LAX 4 1010600 3 303 MSY HOU (231083, 5)