Deduplicate reference libraries using the sweep method.
This library is intended to be used with Reflib compatible references.
// Simple example with an array of references
var Dedupe = require('@iebh/dedupe-sweep');
(new Dedupe())
.set('strategy', 'doiOnly')
.run([
{doi: 'https://doi.org/10.1000/182'},
{doi: '10.1000/182'},
])
.then(deduped => { /* ... */ })
// More complex example reading in a reference library with RefLib, deduping it and saving as another file
var Dedupe = require('@iebh/dedupe-sweep');
var reflib = require('reflib');
// Read in the library
var refs = await reflib.promises.parseFile('my-large-reference-library.xml');
// Dedupe
var deduper = new Dedupe()
deduper.set('strategy', 'clark')
var dedupedRefs = await deduper.run(refs);
// Save the deduped library
await reflib.promises.outputFile('my-large-reference-library-deduped.xml', dedupedRefs);
The various strategies within this project are tested using the Systematic Reviews Data Sets for Testing Automation Tools by Beller et. al and are available in the test/data directory.
Tests can be run via npm test
or mocha
. See the test directory for more information on specifics.
Testing statistics are based on the methodology from Evaluating automated deduplication tools: protocol by Hair et. al.
Returns a Dedupe class which extends a basic EventEmitter.
Object storing all local settings for the class.
Setting | Type | Default | Description |
---|---|---|---|
strategy | string | 'clark' |
The strategy to use on the next run() |
validateStrategy | boolean | true |
Validate the strategy before beginning, only disable this if you are sure the strategy is valid |
action | string | '0 ' |
The action to take when detecting a duplicate. ENUM: ACTIONS |
actionField | string | 'dedupe' |
The field to use with actions |
threshold | number | 0.1 |
Floating value (between 0 and 1) when marking or deleting refs automatically |
markOk | string / function | 'OK' |
String value to set the action field to when actionField=='mark' and the ref is a non-dupe, if a function it is called as (ref) |
markDupe | string / function | 'DUPE' |
String value to set the action field to when actionField=='mark' and the ref is a dupe, if a function it is called as (ref) |
dupeRef | string | 0 |
How to refer to other refs when actionfield=='stats' . ENUM: DUPEREF |
fieldWeight | number | 0 |
How to calculate duplication score. ENUM: FIELDWEIGHT |
markOriginal | boolean | false |
Whether to mark the original as a duplicate or not |
Actions to take when detecting duplicates
Value | Setting | Description |
---|---|---|
0 |
'STATS' |
Add the field field in Dedupe.settings.actionField with the deduplicate chance to the input |
1 |
'MARK' |
Set the field in Dedupe.settings.actionField to Dedupe.settings.mark{Ok,Dupe} depending on duplicate status but leave input unchanged |
2 |
'DELETE' |
Remove duplicates from input and return sliced output |
How to refer to other references.
Value | Setting | Description |
---|---|---|
0 |
'INDEX' |
Refer to other references by their offset in the input array |
1 |
'RECNUMBER' |
Refer to other references by their recnumber field |
How to refer to other references.
Value | Setting | Description |
---|---|---|
0 |
'MINIMUM' |
Calculate duplication score based on minimum field score |
1 |
'AVERAGE' |
Calculate duplication score based on average field score |
A lookup object of comparison functions used within strategies.
Each comparison is made up of:
Setting | Type | Description |
---|---|---|
key | string | Internal short name of the comparison in camelCase |
title |
string | Human friendly title of the comparison |
description |
string | Longer description of what the comparison does |
handler |
function | Function, called as (a, b) for fields which is expected to return a floating value of duplicate-ness |
A lookup object of field mutators used within strategies.
Each mutator is made up of:
Setting | Type | Description |
---|---|---|
key | string | Internal short name of the mutator in camelCase |
title |
string | Human friendly title of the mutator |
description |
string | Longer description of what the mutator does |
handler |
function | Function, called as (value) which is expected to return the mutated input |
A lookup object of strategies.
Each strategy is made up of:
Setting | Type | Description |
---|---|---|
key | string | Internal short name of the strategy in camelCase |
title |
string | Human friendly title of the strategy |
description |
string | Longer description of the strategy |
mutators |
object | List of fields which will be mutated and how, prior to the strategy being run |
steps |
array | Array of steps to take when running the strategy |
Convenience function to quickly set a single option, or merge an object of options. Returns the original Dedupe instance.
Takes an array of input references applying the action specified in Dedupe.settings.action
.
Returns a promise.
This module includes a selection of deduplication strategies which are basic JavaScript objects which detail steps to take to detect reference duplication.
Each strategy should include a title
, description
, optional mutations
and a collection of steps
to perform.
A simple example of the DOI only strategy:
module.exports = {
title: 'DOI only',
description: 'Compare references against DOI fields only',
mutations: {
doi: 'doiRewrite',
},
steps: [
{
fields: ['doi'],
comparison: 'exact',
},
],
};
Strategy format:
Path | Type | Default | Description |
---|---|---|---|
title |
string |
The short human-readable title of the strategy | |
description |
string |
A longer, HTML compatible description of the strategy | |
mutators |
object |
An object of the reference properties to mutate prior to processing, each value should be a known mutator | |
steps |
array |
A collection of steps for the deduplication process | |
steps.skipOmitted |
boolean |
true |
Skip field comparison where either side is not specified |
steps[].fields |
array |
An array of strings, each value should correspond to a known reference field | |
steps[].comparison |
string |
The comparison method to use in this step, should correspond to a known comparison method |