# load libraries
library(readxl)
library(dplyr)
library(janitor)
library(stringr)
library(forcats)
Duplicate Detection
This document outlines how to check for duplicate accession and object identification numbers in PastPerfect. Mixed use of two and four digit encoding of year in accession number and object identification number resulted in duplicate records. This document outlines an approach to identififying and reporting these kinds of duplicate records. The document’s purpose is to support duplicate record merging in PastPerfect or other similar situations. The general approach to duplicate record identification could be applied in a variety of database deduping contexts.
GO TO TABLE: Section 3.1
Problem
I need to find all duplicate object records in a database so that information from the two records (including linked tables) can be merged into a single comprehensive record.
Information about museum objects is tracked by object numbers (Simmons and Kiser 2020; NPS Information Management Team (IMT) 2019; Chenhall 1975). Museum accession numbers are generally based on a binomial following the formula year.accesion
while object numbers are based on a trinomal system using the following formula year.acquisition.object
(Simmons and Kiser 2020, pt. 4). This system encodes key pieces of information about accessions and objects. Each object number should be completely unique, and the museum database should contain no duplicate records.
Unfortunately, over time different individuals registering objects in PastPerfect inconsistently entered year values in both accession and object id fields. In earlier cases, year was entered as a two digit value (i.e. ##
). Now year is entered as a four digit value (i.e. ####
). Additionally, several individuals recently added new records into the database without first checking to see if there were existing records logged with two digit year encoding. As a result, now there are several duplicate records in the database, early records have two digit year encoding and the subsequent duplicates have four digits. Each record contains independent linked information like location and conservation history. It is valuable to merge this information into a single comprehensive canonical record. To get there, one must first identify the duplicate records (Figure 1).
So that this matter does not arise again, identify all records with two digit year encoding and modify the accessions and accession numbers in PastPerfect.
Methods
The following reads the data, collapses four digit year encoding writing it to a new column, establishes counts by this new collapsed number, identifies ids with multiple entries, isolates these in a new table, and joins them to the complete record. That product gets reported as a reactive DT
table.
Read and Clean Data
This probably loads an unnecessary number of libraries, but deploys verbs I’m familiar with. For example, I believe base tapply
could substitute the group by and summarize dplyr
functions.
Data were first exported from PastPerfect. This export file is read into R
. From there, field types are assigned and names are cleaned consistent with R
conventions. A separate script is called to do some standard wrangling.
# Read data
<- read_excel(here::here("posts/2023-09-28-duplicate-detection/data/inventory_2023_02_23.xlsx")) %>% janitor::clean_names()
df
# Clean columns
source(here::here("posts/2023-09-28-duplicate-detection/scripts/past_perfect_clean_fields.R"))
Create New Object Number w. Truncated Year
To compare two digit and four digit year encoding, four digit year encoding were stripped to two digits. Doing this relies on a regular expression or regex. To build the proper expression, I had to fiddle a bit. I worked my way to it by making a minimum reproducible example that isolated the problem (see Section 4).
# Add column composed of two digit objectid
$objectid2 <- str_remove_all(df$objectid , "^19|^20") df
Summarize New Object Number w. Truncated Year
With two digit year encoding in place, group by the new object id field and summarize based on that grouping.
# Get duplicates
<- df |>
df_objectid2 group_by(objectid2) |>
summarise(n = n())
Isolate Duplicates
Filter only those object numbers whose counts are greater than 1. This results in a list of duplicate records.
# Isolate duplicates
<- df_objectid2 |>
df_dupes filter(n>1)
Join Duplicate Table with Complete Record
Join the duplicate record list to the full set of object attributes.
# Join table of duplicates with full record info
<- left_join(df_dupes, df, by = "objectid2", multiple = "all") df_joined
Results
There are 44 duplicate records in the database.
Plot Reactive Table Using JavaScript
2023-03-01 flagdate
is not reporting properly. Need to track that down.
# Other fields: catby, invnby, flagdate
|>
df_joined select(objectid2, catdate, invndate, flagnotes, flagreason, objname) |>
::datatable() DT
Reproducible Example
The following is a reproducible example illustrating the key pieces of the procedure above. Making a small example that isolates the problem is a really good way to finding a solution.
Code
<- c("1991.45.03",
x "91.45.03",
"2002.34.55",
"2002.34.75")
Return only numbers that begin with 19, using a stringr
verbs.
Code
str_detect(x, "^19")
[1] TRUE FALSE FALSE FALSE
Code
str_starts(x, "19")
[1] TRUE FALSE FALSE FALSE
Use stringr::str_remove_all
with the regex to strip four digit year encoding to two digits.
Code
# Removes all when objectid starts with 19 or 20
str_remove_all(x, "^19|^20")
[1] "91.45.03" "91.45.03" "02.34.55" "02.34.75"
References
Citation
@online{craig2023,
author = {Craig, Nathan},
title = {Duplicate {Detection}},
date = {2023-09-28},
url = {https://nmc.quarto.pub/nmc/posts/2023-09-28-duplicate-detection},
langid = {en},
abstract = {This document outlines how to check for duplicate
accession and object identification numbers in PastPerfect. Mixed
use of two and four digit encoding of year in accession number and
object identification number resulted in duplicate records. This
document outlines an approach to identififying and reporting these
kinds of duplicate records. The document’s purpose is to support
duplicate record merging in PastPerfect or other similar situations.
The general approach to duplicate record identification could be
applied in a variety of database deduping contexts.}
}