Searching for strings in vectors with grep()

how-to
wrangling
Author

Nathan Craig

Published

October 20, 2023

Abstract

Searching for strings in a list or a column that meet a given criteria is a common task. These are some notes on how to do it.

I found a recent blog by Maëlle Salmon very useful and worked my way through part of it. That part has to do with grep which stands for “global regular expression search and print” and was originally developed for the Unix operating system. It also runs under a Windows shell and is a command in base R. We start with a vector of strings.

animals <- c("cat", "bird", "dog", "fish")

grepl() is a grep() related R function that returns a logical vector indicating which elements are a match.

grepl("i", animals)
[1] FALSE  TRUE FALSE  TRUE

From grepl() it is possible to get the indices using which(). However, this isn’t good practice. It is better to use grep() directly.

which(grepl("i", animals))
[1] 2 4
grep("i", animals)
[1] 2 4

To return the items rather than their indices, set value = TRUE.

grep("i", animals, value = TRUE)
[1] "bird" "fish"

Now to do this with a dataframe.

df <- dplyr::starwars

First the indices

grep("i", df$name)
 [1]  5  7  9 10 11 12 16 17 18 20 24 29 30 31 33 34 35 38 42 44 49 50 51 52 54
[26] 55 61 63 64 67 76 77 78 80 82 83 87

and then the values.

grep("i", df$name, value = TRUE)
 [1] "Leia Organa"           "Beru Whitesun Lars"    "Biggs Darklighter"    
 [4] "Obi-Wan Kenobi"        "Anakin Skywalker"      "Wilhuff Tarkin"       
 [7] "Jabba Desilijic Tiure" "Wedge Antilles"        "Jek Tono Porkins"     
[10] "Palpatine"             "Lando Calrissian"      "Wicket Systri Warrick"
[13] "Nien Nunb"             "Qui-Gon Jinn"          "Finis Valorum"        
[16] "Padmé Amidala"         "Jar Jar Binks"         "Ric Olié"             
[19] "Shmi Skywalker"        "Bib Fortuna"           "Ben Quadinaros"       
[22] "Mace Windu"            "Ki-Adi-Mundi"          "Kit Fisto"            
[25] "Adi Gallia"            "Saesee Tiin"           "Cliegg Lars"          
[28] "Luminara Unduli"       "Barriss Offee"         "Bail Prestor Organa"  
[31] "San Hill"              "Shaak Ti"              "Grievous"             
[34] "Raymus Antilles"       "Tion Medon"            "Finn"                 
[37] "Captain Phasma"       

Another function Maëlle mentioned was startsWith() and endsWith().

startsWith(df$name, "B")
 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[49]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE  TRUE FALSE

Use which() to get the indices.

which(startsWith(df$name, "B"))
[1]  7  9 21 23 44 49 64 67 86

Use brackets [] to get the values.

df$name[startsWith(df$name, "B")]
[1] "Beru Whitesun Lars"  "Biggs Darklighter"   "Boba Fett"          
[4] "Bossk"               "Bib Fortuna"         "Ben Quadinaros"     
[7] "Barriss Offee"       "Bail Prestor Organa" "BB8"                

Extract operators:

Both help("$") and help("[") will pull up the manual page. Note the operators have to be quoted.

Citation

BibTeX citation:
@online{craig2023,
  author = {Craig, Nathan},
  title = {Searching for Strings in Vectors with `Grep()`},
  date = {2023-10-20},
  url = {https://nmc.quarto.pub/nmc/posts/2023-10-20-grep-thoughts.html},
  langid = {en},
  abstract = {Searching for strings in a list or a column that meet a
    given criteria is a common task. These are some notes on how to do
    it.}
}
For attribution, please cite this work as:
Craig, Nathan. 2023. “Searching for Strings in Vectors with `Grep()`.” October 20, 2023. https://nmc.quarto.pub/nmc/posts/2023-10-20-grep-thoughts.html.