Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid creating duplications #122

Open
myrmoteras opened this issue Dec 13, 2024 · 8 comments
Open

avoid creating duplications #122

myrmoteras opened this issue Dec 13, 2024 · 8 comments

Comments

@myrmoteras
Copy link
Contributor

@gsautter we need to figure out how to create duplication by the upload for the same articles. This one has been uploaded twice from the exactly the same place, but it created nevertheless duplicates.

Dcoument Name: zlae009.pdf, or zlae127.pdf or zlae121.pdf

@flsimoes
Copy link

The zoological journal has this annoying trend of re-releasing the same article under a different name and different hash, which leaves only the DOI as the single flag we can use to check for duplicates, but that's only done after decoding and running the template

@gsautter
Copy link

At least going forward, we could try and track the DOIs of the articles we send to GBIF ... however, this would imply we require a DOI for the GBIF export, so it would get in the way of exporting data that comes without a DOI and cannot get one from Zenodo, either. This mainly concerns data coming in as HTML and the like, like the online Flora Helvetica and InfoFlora, or Species Plantarum Volume 1, which we imported from a Project Gutenberg transcript ...

@gsautter
Copy link

The zoological journal has this annoying trend of re-releasing the same article under a different name and different hash, which leaves only the DOI as the single flag we can use to check for duplicates, but that's only done after decoding and running the template

Any idea as to why they are doing this? Or maybe more like how this happens? Could just as well be a shoddy system, after all ...

@flsimoes
Copy link

The zoological journal has this annoying trend of re-releasing the same article under a different name and different hash, which leaves only the DOI as the single flag we can use to check for duplicates, but that's only done after decoding and running the template

Any idea as to why they are doing this? Or maybe more like how this happens? Could just as well be a shoddy system, after all ...

They publish the ahead of press version and replace it later with the definitive version. Same DOI, but the hash is slightly different with the page numbers and so on.

@gsautter
Copy link

The zoological journal has this annoying trend of re-releasing the same article under a different name and different hash, which leaves only the DOI as the single flag we can use to check for duplicates, but that's only done after decoding and running the template

Any idea as to why they are doing this? Or maybe more like how this happens? Could just as well be a shoddy system, after all ...

They publish the ahead of press version and replace it later with the definitive version. Same DOI, but the hash is slightly different with the page numbers and so on.

This kind of raises the question how if we could somehow limit our imports to the definitive version, I guess? In case of a manual process, that should be a fairly easy thing to do, but even if we use a scraper there, there might be some indications in the file names we might exploit?

@flsimoes
Copy link

They publish the ahead of press version and replace it later with the definitive version. Same DOI, but the hash is slightly different with the page numbers and so on.

This kind of raises the question how if we could somehow limit our imports to the definitive version, I guess? In case of a manual process, that should be a fairly easy thing to do, but even if we use a scraper there, there might be some indications in the file names we might exploit?

There doesn't seem to be an indication in the file name, but we can check.

@myrmoteras
Copy link
Contributor Author

for me it looks as I imported twice the same file. So there is nothing prerelease, but just my mistake importing the same file. This is indicated by the the same file number, ie zlae138, zlae130 https://tb.plazi.org/GgServer/srsStats/stats?outputFields=doc.name+doc.articleUuid+bib.source&groupingFields=doc.name+doc.articleUuid+bib.source&orderingFields=-doc.name&FP-doc.name=zlae%25&FP-bib.source=%22Zoolo%25%22&format=HTML
which are the same, but somehow just erroneously imported by me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants