Check for duplicate sources

check_duplicate_sources(
  primarySources,
  secondarySources = NULL,
  useStringDistances = FALSE,
  stringDistance = 5,
  stringDistanceMethod = "osa",
  charsToZap = "[^A-Za-z0-9]",
  doiCol = "doi",
  matchFully = c("year", "title", "author"),
  matchStart = c(title = 40, author = 30),
  matchEnd = c(title = 40, author = 30),
  forDeduplicationSuffix = "_forDeduplication",
  returnRawStringDistances = FALSE,
  silent = metabefor::opts$get("silent")
)

Arguments

primarySources

The primary dataframe with sources

secondarySources

The secondary dataframe with sources

useStringDistances

Whether to use string distances - note that that can be very slow and take along time if you have thousands of sources.

stringDistance

The string distance for titles

stringDistanceMethod

Method to use for string distance computation

charsToZap

The characters to delete from fields before looking for duplicates

doiCol

The name of the column with the DOIs

matchFully

A vector of columns to check for full matches (after 'zapping'). Pass NULL to not check any columns.

matchStart, matchEnd

Named vectors with columns and numbers of characters to check from the start and from the end. Because requiring full matched can be too conservative, you can also look at the first or last X characters. Pass NULL to not check from the start and from the end, or pass named vectors where the names are the column names and the elements are the corresponding numbers of characters to look at for each column.

forDeduplicationSuffix

Suffix to add to optional deduplication columns

returnRawStringDistances

Whether to return the raw string distances or not (this can be very large).

silent

Whether to be silent or chatty.

Value

A vector indicating for each record whether it's a duplicate, with an attribute called duplicateInfo that holds more detailed information and that can be accessed using the attributes() function.