avoid creating duplications #122

myrmoteras · 2024-12-13T11:21:58Z

@gsautter we need to figure out how to create duplication by the upload for the same articles. This one has been uploaded twice from the exactly the same place, but it created nevertheless duplicates.

Dcoument Name: zlae009.pdf, or zlae127.pdf or zlae121.pdf

flsimoes · 2024-12-16T11:53:36Z

The zoological journal has this annoying trend of re-releasing the same article under a different name and different hash, which leaves only the DOI as the single flag we can use to check for duplicates, but that's only done after decoding and running the template

gsautter · 2024-12-16T16:08:10Z

At least going forward, we could try and track the DOIs of the articles we send to GBIF ... however, this would imply we require a DOI for the GBIF export, so it would get in the way of exporting data that comes without a DOI and cannot get one from Zenodo, either. This mainly concerns data coming in as HTML and the like, like the online Flora Helvetica and InfoFlora, or Species Plantarum Volume 1, which we imported from a Project Gutenberg transcript ...

gsautter · 2024-12-16T16:09:27Z

The zoological journal has this annoying trend of re-releasing the same article under a different name and different hash, which leaves only the DOI as the single flag we can use to check for duplicates, but that's only done after decoding and running the template

Any idea as to why they are doing this? Or maybe more like how this happens? Could just as well be a shoddy system, after all ...

flsimoes · 2024-12-16T17:53:21Z

The zoological journal has this annoying trend of re-releasing the same article under a different name and different hash, which leaves only the DOI as the single flag we can use to check for duplicates, but that's only done after decoding and running the template

Any idea as to why they are doing this? Or maybe more like how this happens? Could just as well be a shoddy system, after all ...

They publish the ahead of press version and replace it later with the definitive version. Same DOI, but the hash is slightly different with the page numbers and so on.

gsautter · 2024-12-16T18:06:20Z

The zoological journal has this annoying trend of re-releasing the same article under a different name and different hash, which leaves only the DOI as the single flag we can use to check for duplicates, but that's only done after decoding and running the template

Any idea as to why they are doing this? Or maybe more like how this happens? Could just as well be a shoddy system, after all ...

They publish the ahead of press version and replace it later with the definitive version. Same DOI, but the hash is slightly different with the page numbers and so on.

This kind of raises the question how if we could somehow limit our imports to the definitive version, I guess? In case of a manual process, that should be a fairly easy thing to do, but even if we use a scraper there, there might be some indications in the file names we might exploit?

flsimoes · 2024-12-16T20:04:41Z

They publish the ahead of press version and replace it later with the definitive version. Same DOI, but the hash is slightly different with the page numbers and so on.

This kind of raises the question how if we could somehow limit our imports to the definitive version, I guess? In case of a manual process, that should be a fairly easy thing to do, but even if we use a scraper there, there might be some indications in the file names we might exploit?

There doesn't seem to be an indication in the file name, but we can check.

myrmoteras · 2024-12-16T21:02:07Z

for me it looks as I imported twice the same file. So there is nothing prerelease, but just my mistake importing the same file. This is indicated by the the same file number, ie zlae138, zlae130 https://tb.plazi.org/GgServer/srsStats/stats?outputFields=doc.name+doc.articleUuid+bib.source&groupingFields=doc.name+doc.articleUuid+bib.source&orderingFields=-doc.name&FP-doc.name=zlae%25&FP-bib.source=%22Zoolo%25%22&format=HTML
which are the same, but somehow just erroneously imported by me

myrmoteras · 2024-12-16T21:03:23Z

https://tb.plazi.org/GgServer/summary/FFBC1516430EFF845F504C732830FFBB vs

https://tb.plazi.org/GgServer/summary/54529F1D566DFF872D4BFFB6FFE4FF98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid creating duplications #122

avoid creating duplications #122

myrmoteras commented Dec 13, 2024

flsimoes commented Dec 16, 2024

gsautter commented Dec 16, 2024

gsautter commented Dec 16, 2024

flsimoes commented Dec 16, 2024

gsautter commented Dec 16, 2024

flsimoes commented Dec 16, 2024

myrmoteras commented Dec 16, 2024

myrmoteras commented Dec 16, 2024

avoid creating duplications #122

avoid creating duplications #122

Comments

myrmoteras commented Dec 13, 2024

flsimoes commented Dec 16, 2024

gsautter commented Dec 16, 2024

gsautter commented Dec 16, 2024

flsimoes commented Dec 16, 2024

gsautter commented Dec 16, 2024

flsimoes commented Dec 16, 2024

myrmoteras commented Dec 16, 2024

myrmoteras commented Dec 16, 2024