-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catching URL nonsense in a DOI #7
Comments
I've met this one too! doi_isbn_pubmed_and_arxiv.enwiki-20161201.tsv.bz2 also contains some
Or
And even:
|
That "/download" could be a perfectly valid part of a DOI. At this point, we'll be implementing heuristics to know when to stop processing a DOI. It seems we could have a list of known suffix patterns that we should strip -- like extensions, "/download", "/getPDF", etc. That would mean DOIs that actually had that as part of the DOI would be broken, but in the end, I expect this will be more useful. |
Are you sure? This is not how I read https://www.doi.org/doi_handbook/2_Numbering.html#2.5 :
AFAICT, everything starting with the second / in a matched string should be dropped. |
They are the only reserved characters, according to https://www.doi.org/doi_handbook/2_Numbering.html#2.5 Cf. mediawiki-utilities#7
Oh good. I hadn't caught that in the spec. |
Ah, I indeed misread the sentence, which is preceded by
Of course a DOI like 10./1234/abcdef or 10..1234/abcdef would be invalid. In the suffix, the dot is frequently used, but I've yet to find any slash. Sadly there's also stuff like:
Which I think the current regex doesn't match. |
I think I found something now: 10.1093/jac/dkh029 |
Noise identificationSo these are the suffixes in the dataset:
There's also a need to URL-decode and HTML-unescape some DOIs like
The only legit DOIs containing a `&' are:
As found by a search We can live with a few odd cases like:
There are then various extraneous unopened All in allI'm running an extraction on the latest dumps and to get a clean list of DOIs I wil run the output through these sed commands, with regexes which IMHO can easily be incorporated in mwcites:
Now
|
The cleanup reduces the latest dump extraction from 777452 to 765499 DOIs, which is a whopping 1.53745 % error correction. ;-) |
Thanks for this thorough analysis. Just finished reading through it. |
They are the only reserved characters, according to https://www.doi.org/doi_handbook/2_Numbering.html#2.5 Cf. mediawiki-utilities#7
Not sure if it's caught in the above issue or is a separate thing but we just ran into an issue with Google Maps URLs being caught as DOIs, because they look similar. |
Were those regexes incorporated in the latest release https://figshare.com/articles/Wikipedia_Scholarly_Article_Citations/1299540 or should I run them myself after downloading it? |
Self-answer: there's still all sorts of spurious DOIs (attached the whole list). After applying my regexes above, the list goes from 1100422 to 1067405 lines. |
We should stop processing a DOI when we see an important URL character -- e.g. "?", "&" or "#".
Still there are some DOIs that have these characters in them. e.g.
10.1002/(SICI)1097-0142(19960401)77:7<1356::AID-CNCR20>3.0.CO;2-#
But most of the time it's because we're processing a messy URL.
The text was updated successfully, but these errors were encountered: