• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / Data Science / vega_datasets: A Python Package for Datasets

vega_datasets: A Python Package for Datasets

April 13, 2018 by cmdlinetips

When you are trying to learn the basics of data science or trying out a new Machine Learning algorithm, an important thing you need is a suitable real world dataset. Often, getting the data set in the right format may be tricky and one may need to spend too much time to search, download and clean it.

Jake Vanderplas, the author of Python Data Science Handbook: Essential Tools for Working with Data, has solved the dataset problem with a relatively new package called vega_datasets. It is a Python package one can easily install using pip and provide access to over 60 datasets of varying sizes. When you load a dataset from the package, it results in a nice pandas dataframe.

How To Install Vega Datasets?

vega_datasets can be easily installed using pip.

pip install vega_datasets

After installing, you can import it to have access to all the datasets.

from vega_datasets import data
import pandas as pd

Some datasets are locally available and one can readily access without an internet connection. One can see which datasets are locally available with local_data

from vega_datasets import local_data
local_data.list_datasets()
['airports',
 'anscombe',
 'barley',
 'burtin',
 'cars',
 'crimea',
 'driving',
 'iris',
 'seattle-temps',
 'seattle-weather',
 'sf-temps',
 'stocks']

You can also easily find all available datasets with list_datasets(). There are over 60 datasets available in total.

len(data.list_datasets())
62

Let us have a quick look at some of the available vega datasets.

Iris dataset

iris = data.iris()
iris.head()

Gapminder dataset

gapminder = data.gapminder()
gapminder.head()

Airports dataset

airports=data.airports()
print(airports.head())

US State Capital dataset

capitals = data.us_state_capitals()

San Franciso Temperature

sf_temps =data.sf_temps()
print(sf_temps.head())
   temp                date
0  47.8 2010-01-01 00:00:00
1  47.4 2010-01-01 01:00:00
2  46.9 2010-01-01 02:00:00
3  46.5 2010-01-01 03:00:00
4  46.0 2010-01-01 04:00:00

Seattle Temperature

seattle_temps = data.seattle_temps()
print(seattle_temps.head())
                 date  temp
0 2010-01-01 00:00:00  39.4
1 2010-01-01 01:00:00  39.2
2 2010-01-01 02:00:00  39.0
3 2010-01-01 03:00:00  38.9
4 2010-01-01 04:00:00  38.8

Seattle Weather

seattle_weather = data.seattle_weather()
print(seattle_weather.head(n=3))
        date  precipitation  temp_max  temp_min  wind  weather
0 2012-01-01            0.0      12.8       5.0   4.7  drizzle
1 2012-01-02           10.9      10.6       2.8   4.5     rain
2 2012-01-03            0.8      11.7       7.2   2.3     rain

Flights data

flights_3m = data.flights_3m()
print(flights_3m.head())
print(flights_3m.shape)
      date  delay  distance origin destination
0  1010001     14       405    MCI         MDW
1  1010530    -11       370    LAX         PHX
2  1010540      5       389    ONT         SMF
3  1010600     -5       337    OAK         LAX
4  1010600      3       303    MSY         HOU
(231083, 5)

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Default ThumbnailHow To Loop Through Pandas Rows? or How To Iterate Over Pandas Rows? R LogoHow To Install a R Package Locally and Load it Easily? Default ThumbnailHow To Convert a Column to Row Name/Index in Pandas? Default ThumbnailHow to Install Packages from the Jupyter Notebook?

Filed Under: Data Science, Datasets for Data Science, Pandas DataFrame, Python dataset package, Python Tips, vega datasets in Python, vega_datasets package Tagged With: Datasets for Data Science, Python dataset package, Python Tips, vega datasets in Python, vega_datasets package

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version