Find and replace in R data frames

how-to
wrangling
Author

Nathan Craig

Published

October 21, 2023

Abstract

One of the most common data wrangling tasks is string manipulation involving the need to 1) within the column of a data frame 2) find a value and change it to something else. This post looks at base R functions sub() and gsub as well as functions in the tidyverse libraries stringr and dplyr.

Often times one wants to work with a data set, but the variables in a column were not consistently entered. For example, say there is a column for material and there are values wood, Wood, and WooD. Effectively we have three separate factors here, but really they are just typos. We want them all to say wood because it will allow for more accurate grouping for purposes like summarizing or graphing. So how does one enforce consistency in strings (and by extension factors) in R? Turns out there are a few ways. Here we look at the base R functions sub() (for the first instance in vector) and gsub() (for all instances in vector) as well as the stringr functions string_replace() (for the first instance in vector) along with string_replace_all() (for all instances in a vector).

library(dplyr)
library(stringr)

Create some data

df <- tribble(
  ~col1, ~col2,
  "a",   1,
  "b",   2,
  "c",   3,
  "ab", 4,
  "ac", 5,
  "ag", 6)

gsub() from base R

There are two key functions: sub() which will replace the first instance and gsub() which will replace all instances. The basic syntax is gsub(pattern, replacement, x). There are other handy arguments like inverse and ignore.case. Some examples…

Change every instance of a in col1 to X.

gsub("a", "X", df$col1)
[1] "X"  "b"  "c"  "Xb" "Xc" "Xg"

To apply this change to the column use the following df$col1 <- gsub("a", "X", df$col1)

Note that it is possible to use regular expressions in the first argument

# ab OR ac
gsub("ab|ac", "D", df$col1)
[1] "a"  "b"  "c"  "D"  "D"  "ag"

starts with a

gsub("^a", "X", df$col1)
[1] "X"  "b"  "c"  "Xb" "Xc" "Xg"

Ends with c or with g

gsub("c$|g$", "X", df$col1)
[1] "a"  "b"  "X"  "ab" "aX" "aX"

stringr::str_replace_all()

Use stringr::str_replace_all() to perform find and replace. Use | as the or operator.

str_replace_all(df$col1, pattern = "ab|ac", replacement = "D")
[1] "a"  "b"  "c"  "D"  "D"  "ag"

This can be assigned back to the column df$col1 <- str_replace_all(df$col1, pattern = "ab|ac", replacement = "D")

Use stringr::str_replace_all() inside a dplyr::mutate() function to perform find and replace.

df %>% 
  mutate(col1 = str_replace_all(col1, pattern = "ab|ac", replacement = "D"),
         col1 = str_replace_all(col1, pattern = "ag", replacement = "E"))
# A tibble: 6 × 2
  col1   col2
  <chr> <dbl>
1 a         1
2 b         2
3 c         3
4 D         4
5 D         5
6 E         6

Regular expression ^ for starts with.

df %>% 
  mutate(col1 = str_replace_all(col1, pattern = "^a", replacement = "E"))
# A tibble: 6 × 2
  col1   col2
  <chr> <dbl>
1 E         1
2 b         2
3 c         3
4 Eb        4
5 Ec        5
6 Eg        6

Regular expression $ for ends with.

df %>% 
  mutate(col1 = str_replace_all(col1, pattern = "c$", replacement = "E"))
# A tibble: 6 × 2
  col1   col2
  <chr> <dbl>
1 a         1
2 b         2
3 E         3
4 ab        4
5 aE        5
6 ag        6

Use ^ starts with and $ ends with to narrow what is returned.

df %>% 
  mutate(col1 = str_replace_all(col1, pattern = "^ac$", replacement = "E"))
# A tibble: 6 × 2
  col1   col2
  <chr> <dbl>
1 a         1
2 b         2
3 c         3
4 ab        4
5 E         5
6 ag        6

If there is a long list of strings that all need to conform to a single value, replace() is an option to consider.

change_vector <- c("ab", "ac", "ag")

df |> mutate(col1 = replace(col1, col1 %in% change_vector, "D"))
# A tibble: 6 × 2
  col1   col2
  <chr> <dbl>
1 a         1
2 b         2
3 c         3
4 D         4
5 D         5
6 D         6

Citation

BibTeX citation:
@online{craig2023,
  author = {Craig, Nathan},
  title = {Find and Replace in {R} Data Frames},
  date = {2023-10-21},
  url = {https://nmc.quarto.pub/nmc/posts/2023-10-21-find-and-replace.html},
  langid = {en},
  abstract = {One of the most common data wrangling tasks is string
    manipulation involving the need to 1) within the column of a data
    frame 2) find a value and change it to something else. This post
    looks at base `R` functions `sub()` and `gsub` as well as functions
    in the `tidyverse` libraries `stringr` and `dplyr`.}
}
For attribution, please cite this work as:
Craig, Nathan. 2023. “Find and Replace in R Data Frames.” October 21, 2023. https://nmc.quarto.pub/nmc/posts/2023-10-21-find-and-replace.html.