Duplicate Detection

how-to

museum

wrangling

Author

Nathan Craig

Published

September 28, 2023

Abstract

This document outlines how to check for duplicate accession and object identification numbers in PastPerfect. Mixed use of two and four digit encoding of year in accession number and object identification number resulted in duplicate records. This document outlines an approach to identififying and reporting these kinds of duplicate records. The document’s purpose is to support duplicate record merging in PastPerfect or other similar situations. The general approach to duplicate record identification could be applied in a variety of database deduping contexts.

¡TL/DR!

GO TO TABLE: Section 3.1

Problem

I need to find all duplicate object records in a database so that information from the two records (including linked tables) can be merged into a single comprehensive record.

Information about museum objects is tracked by object numbers (Simmons and Kiser 2020; NPS Information Management Team (IMT) 2019; Chenhall 1975). Museum accession numbers are generally based on a binomial following the formula year.accesion while object numbers are based on a trinomal system using the following formula year.acquisition.object (Simmons and Kiser 2020, pt. 4). This system encodes key pieces of information about accessions and objects. Each object number should be completely unique, and the museum database should contain no duplicate records.

Unfortunately, over time different individuals registering objects in PastPerfect inconsistently entered year values in both accession and object id fields. In earlier cases, year was entered as a two digit value (i.e. ##). Now year is entered as a four digit value (i.e. ####). Additionally, several individuals recently added new records into the database without first checking to see if there were existing records logged with two digit year encoding. As a result, now there are several duplicate records in the database, early records have two digit year encoding and the subsequent duplicates have four digits. Each record contains independent linked information like location and conservation history. It is valuable to merge this information into a single comprehensive canonical record. To get there, one must first identify the duplicate records (Figure 1).

Figure 1: Science cat does the math to find duplicates in the data

For the future

So that this matter does not arise again, identify all records with two digit year encoding and modify the accessions and accession numbers in PastPerfect.

Methods

The following reads the data, collapses four digit year encoding writing it to a new column, establishes counts by this new collapsed number, identifies ids with multiple entries, isolates these in a new table, and joins them to the complete record. That product gets reported as a reactive DT table.

Read and Clean Data

This probably loads an unnecessary number of libraries, but deploys verbs I’m familiar with. For example, I believe base tapply could substitute the group by and summarize dplyr functions.

# load libraries
library(readxl)
library(dplyr)
library(janitor)
library(stringr)
library(forcats)

Data were first exported from PastPerfect. This export file is read into R. From there, field types are assigned and names are cleaned consistent with R conventions. A separate script is called to do some standard wrangling.

# Read data
df <- read_excel(here::here("posts/2023-09-28-duplicate-detection/data/inventory_2023_02_23.xlsx")) %>% janitor::clean_names()

# Clean columns
source(here::here("posts/2023-09-28-duplicate-detection/scripts/past_perfect_clean_fields.R"))

Create New Object Number w. Truncated Year

To compare two digit and four digit year encoding, four digit year encoding were stripped to two digits. Doing this relies on a regular expression or regex. To build the proper expression, I had to fiddle a bit. I worked my way to it by making a minimum reproducible example that isolated the problem (see Section 4).

# Add column composed of two digit objectid
df$objectid2 <- str_remove_all(df$objectid , "^19|^20")

Summarize New Object Number w. Truncated Year

With two digit year encoding in place, group by the new object id field and summarize based on that grouping.

# Get duplicates
df_objectid2 <- df |> 
  group_by(objectid2) |> 
  summarise(n = n())

Isolate Duplicates

Filter only those object numbers whose counts are greater than 1. This results in a list of duplicate records.

# Isolate duplicates
df_dupes <- df_objectid2 |> 
  filter(n>1)

Join Duplicate Table with Complete Record

Join the duplicate record list to the full set of object attributes.

# Join table of duplicates with full record info
df_joined <- left_join(df_dupes, df, by = "objectid2", multiple = "all")

Results

There are 44 duplicate records in the database.

Plot Reactive Table Using JavaScript

Note

2023-03-01 flagdate is not reporting properly. Need to track that down.

# Other fields: catby, invnby, flagdate

df_joined |> 
  select(objectid2, catdate, invndate, flagnotes, flagreason, objname) |> 
  DT::datatable()

Table 1: Duplicate Records

Reproducible Example

The following is a reproducible example illustrating the key pieces of the procedure above. Making a small example that isolates the problem is a really good way to finding a solution.

Code

x <- c("1991.45.03",
       "91.45.03",
       "2002.34.55",
       "2002.34.75")

Return only numbers that begin with 19, using a stringr verbs.

Code

str_detect(x, "^19")

[1]  TRUE FALSE FALSE FALSE

Code

str_starts(x, "19")

[1]  TRUE FALSE FALSE FALSE

Use stringr::str_remove_all with the regex to strip four digit year encoding to two digits.

Code

# Removes all when objectid starts with 19 or 20
str_remove_all(x, "^19|^20")

[1] "91.45.03" "91.45.03" "02.34.55" "02.34.75"

References

Chenhall, Robert G. 1975. Museum Cataloging in the Computer Age. Nashville: American Association for State; Local History.

NPS Information Management Team (IMT), NPS Museum Management Program (MMP), ed. 2019. The Museum Handbook Part i: Museum Collections. Washington, D. C.: National Park Service. https://www.nps.gov/museum/publications/MHI/MHI.pdf.

Simmons, John E., and Toni M. Kiser, eds. 2020. MRM6: Museum Registration Methods. 6th edition. Lanham Boulder New York London: Rowan & Littlefield.

Citation

BibTeX citation:

@online{craig2023,
  author = {Craig, Nathan},
  title = {Duplicate {Detection}},
  date = {2023-09-28},
  url = {https://nmc.quarto.pub/nmc/posts/2023-09-28-duplicate-detection},
  langid = {en},
  abstract = {This document outlines how to check for duplicate
    accession and object identification numbers in PastPerfect. Mixed
    use of two and four digit encoding of year in accession number and
    object identification number resulted in duplicate records. This
    document outlines an approach to identififying and reporting these
    kinds of duplicate records. The document’s purpose is to support
    duplicate record merging in PastPerfect or other similar situations.
    The general approach to duplicate record identification could be
    applied in a variety of database deduping contexts.}
}

For attribution, please cite this work as:

Craig, Nathan. 2023. “Duplicate Detection.” September 28, 2023. https://nmc.quarto.pub/nmc/posts/2023-09-28-duplicate-detection.