Missing data is a common problem while doing data analysis. Sometimes you might to remove the missing data. One approach is to remove rows containing missing values. In this post we will see examples of removing rows containing missing values using dplyr in R.
We will use dplyr’s function drop_na() to remove rows that contains missing data. Let us load tidyverse first.
library("tidyverse")
As in other tidyverse 101 examples, we will use the fantastic Penguins dataset to illustrate the three ways to see data in a dataframe. Let us load the data from cmdlinetips.com’ github page.
path2data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv" penguins<- readr::read_csv(path2data)
Let us move sex column which has a number of missing values to the front using dplyr’s relocate() function.
# move sex column to first penguins <- penguins %>% relocate(sex)
We can see that our data frame has 344 rows in total and a number of rows have missing values. Note the fourth row has missing values for most the columns and it is represented as “NA”.
penguins ## # A tibble: 344 x 7 ## sex species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 male Adelie Torge… 39.1 18.7 181 3750 ## 2 fema… Adelie Torge… 39.5 17.4 186 3800 ## 3 fema… Adelie Torge… 40.3 18 195 3250 ## 4 <NA> Adelie Torge… NA NA NA NA ## 5 fema… Adelie Torge… 36.7 19.3 193 3450 ## 6 male Adelie Torge… 39.3 20.6 190 3650
Let us use dplyr’s drop_na() function to remove rows that contain at least one missing value.
penguins %>% drop_na()
Now our resulting data frame contains 333 rows after removing rows with missing values. Note that the fourth row in our original dataframe had missing values and now it is removed.
## # A tibble: 333 x 7 ## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Adelie Torge… 39.1 18.7 181 3750 ## 2 Adelie Torge… 39.5 17.4 186 3800 ## 3 Adelie Torge… 40.3 18 195 3250 ## 4 Adelie Torge… 36.7 19.3 193 3450 ## 5 Adelie Torge… 39.3 20.6 190 3650 ## 6 Adelie Torge… 38.9 17.8 181 3625
How to Remove Rows Based on Missing Values in a Column?
Sometimes you might want to removes rows based on missing values in one or more columns in the dataframe. To remove rows based on missing values in a column.
penguins %>% drop_na(bill_length_mm)
We have removed the rows based on missing values in bill_length_mm column. In comparison to the above example, the resulting dataframe contains missing values from other columns. In this example, we can see missing values Note that
## # A tibble: 342 x 7 ## sex species island bill_length_mm bill_depth_mm flipper_length_… ## <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 male Adelie Torge… 39.1 18.7 181 ## 2 fema… Adelie Torge… 39.5 17.4 186 ## 3 fema… Adelie Torge… 40.3 18 195 ## 4 fema… Adelie Torge… 36.7 19.3 193 ## 5 male Adelie Torge… 39.3 20.6 190 ## 6 fema… Adelie Torge… 38.9 17.8 181 ## 7 male Adelie Torge… 39.2 19.6 195 ## 8 <NA> Adelie Torge… 34.1 18.1 193 ## 9 <NA> Adelie Torge… 42 20.2 190 ## 10 <NA> Adelie Torge… 37.8 17.1 186 ## # … with 332 more rows, and 1 more variable: body_mass_g <dbl>