check_duplicate_sources.Rd
Check for duplicate sources
check_duplicate_sources(
primarySources,
secondarySources = NULL,
useStringDistances = FALSE,
stringDistance = 5,
stringDistanceMethod = "osa",
charsToZap = "[^A-Za-z0-9]",
doiCol = "doi",
matchFully = c("year", "title", "author"),
matchStart = c(title = 40, author = 30),
matchEnd = c(title = 40, author = 30),
forDeduplicationSuffix = "_forDeduplication",
returnRawStringDistances = FALSE,
silent = metabefor::opts$get("silent")
)
The primary dataframe with sources
The secondary dataframe with sources
Whether to use string distances - note that that can be very slow and take along time if you have thousands of sources.
The string distance for titles
Method to use for string distance computation
The characters to delete from fields before looking for duplicates
The name of the column with the DOIs
A vector of columns to check for full
matches (after 'zapping'). Pass NULL
to not check any columns.
Named vectors with columns and numbers of
characters to check from the start and from the end. Because requiring full
matched can be too conservative, you can also look at the first or last X
characters. Pass NULL
to not check from the start and from the end, or
pass named vectors where the names are the column names and the elements
are the corresponding numbers of characters to look at for each column.
Suffix to add to optional deduplication columns
Whether to return the raw string distances or not (this can be very large).
Whether to be silent or chatty.
A vector indicating for each record whether it's a duplicate, with
an attribute called duplicateInfo
that holds more detailed information
and that can be accessed using the attributes()
function.