Dealing with compound fields

how-to
wrangling
Author

Nathan Craig

Published

October 13, 2022

Abstract

We have a table containing a compound field that is composed of multiple values separated by a delimeter. Here we are simulating data exported from PastPerfect museum catalog software. There are several such cases one might encounter: material, use, culture, etc.

Generally, it is not a good idea to put multiple values into a single table cell. This is not tidy and it is cumbersome to deal with fields so structured. However, given the constraints of some data entry systems, one may need to encode compound values to describe the observation. Suppose an individual is recording an object made of three materials but the data entry form only offers a single field (Figure 1). We don’t want to log just one value; we would rather capture all three materials. When confronted with such situations, people often enter some kind of delimited value like apple/pear/orange where / is the separator for three values. What follows is a way of dealing with this situation in R.

Figure 1: Past Perfect input screen with compound fields highlighted in red.
# Load libraries
library(tidyverse)
library(stringi)
# Construct data frame
df <- data.frame(
  id = 1:5,
  material = c(
    "stone",
    "stone/bone/sinew",
    "ceramic/paint",
    "metal/stone/bone/sinew/paint",
    NA
  ),
  location = "NM"
)

df
Table 1: Sample dataframe.
id material location
1 stone NM
2 stone/bone/sinew NM
3 ceramic/paint NM
4 metal/stone/bone/sinew/paint NM
5 NA NM

Now we have a data frame where material contains entries representing more than one value. With stringi R can parse that complex string into a list.

# Split the string into a list
df$material <- str_split(df$material, "/")

It is possible to access a given list item with unlist(). This can be done for individual records, ranges of records, or the entire column.

# Get the list item from the second record
df[2,][,2]
[[1]]
[1] "stone" "bone"  "sinew"
# Unpack the list associated with that record
unlist(df[2,][,2])
[1] "stone" "bone"  "sinew"
# Unpack the entire list column as a single list
x <- unlist(df[,2])
x
 [1] "stone"   "stone"   "bone"    "sinew"   "ceramic" "paint"   "metal"  
 [8] "stone"   "bone"    "sinew"   "paint"   NA       

Let’s turn the list of all values into its own data frame.

df_x <- tibble(x)

To see how many of each type of object we have, we can use tidyverse verbs to group by and summarize (Table 2).

df_x %>% group_by(x) %>% 
  summarize(n = n())
Table 2: Counts by value extracted from a compound field.
x n
bone 2
ceramic 1
metal 1
paint 2
sinew 2
stone 3
NA 1

Let’s search for a specific material and filter the table returning only records that meet the search criteria (Table 3).

# Test for the presence of a string
# returns a TRUE/FALSE vector
a <- stri_detect_regex(df$material, "stone")

# Get the numerical value of the record when TRUE
b <- which(a == TRUE)

# Call the rows based on the index
df[b,]
Table 3: Primary table filtered based on presence of search string in compound field.
id material location
1 1 stone NM
2 2 stone, bone , sinew NM
4 4 metal, stone, bone , sinew, paint NM

Citation

BibTeX citation:
@online{craig2022,
  author = {Craig, Nathan},
  title = {Dealing with Compound Fields},
  date = {2022-10-13},
  url = {https://nmc.quarto.pub/nmc/posts/2022-10-13-compound-fields},
  langid = {en},
  abstract = {We have a table containing a compound field that is
    composed of multiple values separated by a delimeter. Here we are
    simulating data exported from PastPerfect museum catalog software.
    There are several such cases one might encounter: material, use,
    culture, etc.}
}
For attribution, please cite this work as:
Craig, Nathan. 2022. “Dealing with Compound Fields.” October 13, 2022. https://nmc.quarto.pub/nmc/posts/2022-10-13-compound-fields.