-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid creating duplications #122
Comments
The zoological journal has this annoying trend of re-releasing the same article under a different name and different hash, which leaves only the DOI as the single flag we can use to check for duplicates, but that's only done after decoding and running the template |
At least going forward, we could try and track the DOIs of the articles we send to GBIF ... however, this would imply we require a DOI for the GBIF export, so it would get in the way of exporting data that comes without a DOI and cannot get one from Zenodo, either. This mainly concerns data coming in as HTML and the like, like the online Flora Helvetica and InfoFlora, or Species Plantarum Volume 1, which we imported from a Project Gutenberg transcript ... |
Any idea as to why they are doing this? Or maybe more like how this happens? Could just as well be a shoddy system, after all ... |
They publish the ahead of press version and replace it later with the definitive version. Same DOI, but the hash is slightly different with the page numbers and so on. |
This kind of raises the question how if we could somehow limit our imports to the definitive version, I guess? In case of a manual process, that should be a fairly easy thing to do, but even if we use a scraper there, there might be some indications in the file names we might exploit? |
There doesn't seem to be an indication in the file name, but we can check. |
for me it looks as I imported twice the same file. So there is nothing prerelease, but just my mistake importing the same file. This is indicated by the the same file number, ie zlae138, zlae130 https://tb.plazi.org/GgServer/srsStats/stats?outputFields=doc.name+doc.articleUuid+bib.source&groupingFields=doc.name+doc.articleUuid+bib.source&orderingFields=-doc.name&FP-doc.name=zlae%25&FP-bib.source=%22Zoolo%25%22&format=HTML |
@gsautter we need to figure out how to create duplication by the upload for the same articles. This one has been uploaded twice from the exactly the same place, but it created nevertheless duplicates.
Dcoument Name: zlae009.pdf, or zlae127.pdf or zlae121.pdf
The text was updated successfully, but these errors were encountered: