Sometimes you may want to loop/iterate over Pandas data frame and do some operation on each rows. Pandas has at least two options to iterate over rows of a dataframe.
Let us see examples of how to loop through Pandas data frame. First we will use Pandas iterrows function to iterate over rows of a Pandas dataframe. In addition to iterrows, Pandas also has an useful function itertuples(). We will also see examples of using itertuples() to iterate over rows of Pandas dataframe. There are subtle differences in using each of them, and we will also see them.
Let us use an interesting dataset available in vega_datasets in Python.
# import vega_dataets from vega_datasets import data #import pandas import pandas as pd
Let us see the available datasets in vega_datasets and use flights_2k data set.
# check to see the list of data sets data.list_datasets() flights=data.flights_2k()
It contains flight departure, arrival, and distance information for 2000 flights.
flights.head() date delay destination distance origin 0 2001-01-14 21:55:00 0 SMF 480 SAN 1 2001-03-26 20:15:00 -11 SLC 507 PHX 2 2001-03-05 14:55:00 -3 LAX 714 ELP
How to Iterate Through Rows with Pandas iterrows()
Pandas has iterrows() function that will help you loop through each row of a dataframe. Pandas’ iterrows() returns an iterator containing index of each row and the data in each row as a Series.
Since iterrows() returns iterator, we can use next function to see the content of the iterator. We can see that it iterrows returns a tuple with row index and row data as a Series object.
>next(flights.iterrows()) (0, date 2001-01-14 21:55:00 delay 0 destination SMF distance 480 origin SAN Name: 0, dtype: object)
We can get the content row by taking the second element of the tuple.
row = next(flights.iterrows())[1] row date 2001-01-14 21:55:00 delay 0 destination SMF distance 480 origin SAN Name: 0, dtype: object
We can loop through Pandas dataframe and access the index of each row and the content of each row easily. Here we print the iterator from iterrows() and see that we get an index and Series for each row.
for index, row in flights.head(n=2).iterrows(): print(index, row)
0 date 2001-01-14 21:55:00 delay 0 destination SMF distance 480 origin SAN Name: 0, dtype: object 1 date 2001-03-26 20:15:00 delay -11 destination SLC distance 507 origin PHX Name: 1, dtype: object
Since the row data is returned as a Series, we can use the column names to access each column’s value in the row. Here we loop through each row and we assign the row index and row data to variables named index and row. Then we access row data using the column names of the dataframe.
# iterate over rows with iterrows() for index, row in flights.head().iterrows(): # access data using column names print(index, row['delay'], row['distance'], row['origin'])
0 0 480 SAN 1 -11 507 PHX 2 -3 714 ELP 3 12 342 SJC 4 2 373 SMF
Because iterrows() returns a Series for each row, it does not preserve data types across the rows. However, data types are preserved across columns for DataFrames. Let us see a simple example illustrating this
Let us create a simple data frame with one row with two columns, where one column is an int and the other is a float.
>df = pd.DataFrame([[3, 5.5]], columns=['int_column', 'float_column']) >print(df) int_column float_column 0 3 5.5
Let us use iterrows() to get the content of row and print the data type of int_column. In the original dataframe int_column is an integer. However, when see the data type through iterrows(), the int_column is a float object
>row = next(df.iterrows())[1] >print(row['int_column'].dtype) float64
How to Iterate Over Rows of Pandas Dataframe with itertuples()
A better way to iterate/loop through rows of a Pandas dataframe is to use itertuples() function available in Pandas. As the name itertuples() suggest, itertuples loops through rows of a dataframe and return a named tuple.
The first element of the tuple is row’s index and the remaining values of the tuples are the data in the row. Unlike iterrows, the row data is not stored in a Series.
Let us loop through content of dataframe and print each row with itertuples.
for row in flights.head().itertuples(): print(row) Pandas(Index=0, date=Timestamp('2001-01-14 21:55:00'), delay=0, destination='SMF', distance=480, origin='SAN') Pandas(Index=1, date=Timestamp('2001-03-26 20:15:00'), delay=-11, destination='SLC', distance=507, origin='PHX') Pandas(Index=2, date=Timestamp('2001-03-05 14:55:00'), delay=-3, destination='LAX', distance=714, origin='ELP') Pandas(Index=3, date=Timestamp('2001-01-07 12:30:00'), delay=12, destination='SNA', distance=342, origin='SJC') Pandas(Index=4, date=Timestamp('2001-01-18 12:00:00'), delay=2, destination='LAX', distance=373, origin='SMF')
We can see that itertuples simply returns the content of row as named tuple with associated column names. Therefore we can simply access the data with column names and Index, like
for row in flights.head().itertuples(): print(row.Index, row.date, row.delay)
We will get each row as
0 2001-01-14 21:55:00 0 1 2001-03-26 20:15:00 -11 2 2001-03-05 14:55:00 -3 3 2001-01-07 12:30:00 12 4 2001-01-18 12:00:00 2
Another benefit of itertuples is that it is generally faster than iterrows().