Catching URL nonsense in a DOI #7

halfak · 2015-10-29T15:47:30Z

We should stop processing a DOI when we see an important URL character -- e.g. "?", "&" or "#".

Still there are some DOIs that have these characters in them. e.g. 10.1002/(SICI)1097-0142(19960401)77:7<1356::AID-CNCR20>3.0.CO;2-#

But most of the time it's because we're processing a messy URL.

The text was updated successfully, but these errors were encountered:

nemobis · 2017-07-06T15:13:22Z

10.1002/(SICI)1097-0142(19960401)77:7<1356::AID-CNCR20>3.0.CO;2-#

I've met this one too!

doi_isbn_pubmed_and_arxiv.enwiki-20161201.tsv.bz2 also contains some /download such as:

10.18352/bmgn-lchr.7417/galley/7464/download
10.4038/tapro.v5i1.5654/galley/4523/download

Or /getPDF:

10.00024/bdef:TuftsPDF/getPDF
10.00025/bdef:TuftsPDF/getPDF

And even:

10.1002/14356007.a12_495/abstract;jsessionid=EFC500556A6060AC9BEC57789816DC84.f01t01

halfak · 2017-07-06T15:19:07Z

That "/download" could be a perfectly valid part of a DOI. At this point, we'll be implementing heuristics to know when to stop processing a DOI. It seems we could have a list of known suffix patterns that we should strip -- like extensions, "/download", "/getPDF", etc. That would mean DOIs that actually had that as part of the DOI would be broken, but in the end, I expect this will be more useful.

nemobis · 2017-07-07T14:13:32Z

That "/download" could be a perfectly valid part of a DOI.

Are you sure? This is not how I read https://www.doi.org/doi_handbook/2_Numbering.html#2.5 :

Handle syntax imposes two constraints on the prefix — both slash and dot are "reserved characters", with the slash separating the prefix from the suffix and the dot used to extend sub prefixes.

AFAICT, everything starting with the second / in a matched string should be dropped.

They are the only reserved characters, according to https://www.doi.org/doi_handbook/2_Numbering.html#2.5 Cf. mediawiki-utilities#7

halfak · 2017-07-07T14:42:29Z

Oh good. I hadn't caught that in the spec.

nemobis · 2017-07-07T22:18:38Z

Ah, I indeed misread the sentence, which is preceded by

Neither the Handle System nor DOI system policies, nor any web use currently imaginable, impose any constraints on the suffix, outside of encoding (see below).

Of course a DOI like 10./1234/abcdef or 10..1234/abcdef would be invalid. In the suffix, the dot is frequently used, but I've yet to find any slash. Sadly there's also stuff like:

10.1671/0272-4634(2002)022[0564:EAEFTC]2.0.CO;2

Which I think the current regex doesn't match.

nemobis · 2017-07-08T07:29:43Z

I think I found something now:

10.1093/jac/dkh029
10.1088/0953-2048/20/8/L03
10.1093/jac/39.3.393

nemobis · 2017-08-12T21:19:58Z

Noise identification

So these are the suffixes in the dataset:

grep -Eo "/[a-z]+$" doi.enwiki-20161201.txt | sort | uniq -c | sort -nr
   6523 /abstract
   1674 /full
   1243 /pdf
    505 /issues
    416 /currentissue
    216 /epdf
    114 /issuetoc
     90 /summary
     76 /meta
     32 /pdb
     17 /a
      9 /references
      7 /suppinfo
      5 /otherversions
      5 /citedby
      5 /b
      4 /c
      3 /standard
      3 /j
      2 /e
      2 /download
      2 /deaths
      2 /d
      2 /core
      1 /wu
      1 /wright
      1 /towne
      1 /topics
      1 /sys
      1 /stadaf
      1 /soeknr
      1 /science
      1 /sce
      1 /s
      1 /rstl
      1 /rspb
      1 /rra
      1 /rob
      1 /ref
      1 /rcm
      1 /ppi
      1 /polb
      1 /pletnik
      1 /panetti
      1 /p
      1 /nsm
      1 /metrics
      1 /masai
      1 /marks
      1 /lt
      1 /lrshef
      1 /nsm
      1 /metrics
      1 /masai
      1 /marks
      1 /lt
      1 /lrshef
      1 /lo
      1 /komatsu
      1 /kim
      1 /kier
      1 /journal
      1 /job
      1 /jid
      1 /jacsm
      1 /itj
      1 /isom
      1 /ijhit
      1 /ic
      1 /hrdq
      1 /home
      1 /gt
      1 /goldbook
      1 /gm
      1 /g
      1 /fsu
      1 /fneng
      1 /figures
      1 /erg
      1 /enu
      1 /enhanced
      1 /earlyview
      1 /djlit
      1 /dev
      1 /dcsupplemental
      1 /dawson
      1 /cst
      1 /comments
      1 /cne
      1 /cleaver
      1 /chemse
      1 /bjmcs
      1 /beej
      1 /bay
      1 /azl
      1 /articledoi
      1 /armulik
      1 /albers
      1 /ai
      1 /acref
      1 /abstrac
      1 /abstact

There's also a need to URL-decode and HTML-unescape some DOIs like

10.1002/(SICI)1096-8644(199602)99:2&lt;345::AID-AJPA9&gt;3.0.CO;2-X
10.1644/1545-1542(2000)081&lt;1025:PROPGG&gt;2.0.CO;2
10.1666/0094-8373(2000)026&lt;0450:FPINDI&gt;2.0.CO;2
10.1002/(SICI)1096-8644(199602)99:2&lt;345::AID-AJPA9&gt;3.0.CO;2-X
10.1093/acref/9780199666317.001.0001/acref-9780199666317-e-4513&gt;
10.1666/0094-8373%282004%29030%3C0203:PODITE%3E2.0.CO&rfr_id=info:sid/libx&rft.genre=article
10.1111/j.1558&ndash;5646.2007.00179.x

The only legit DOIs containing a `&' are:

10.1075/p&c.16.2.07kel
10.11588/ai.2006.1&2.11114
10.1207/s15324834basp2602&3_7
10.1207/s1532690xci1002&3_2
10.1207/s15326985ep2603&4_6
10.1207/s15327043hup1102&3_3
10.1207/s15327051hci0603&4_6
10.1207/s15327663jcp1401&2_19
10.1207/s15327698jfc0403&4_5
10.1207/S15327728JMME1602&3_4
10.1207/S15327965PLI1403&4_17
10.1207/s15327965pli1403&4_21
10.1207/S15327965PLI1403&4_9
10.1207/s15427439tc1202&3_6
10.1207/s15473341lld0103&4_2
10.2495/D&NE-V4-N2-154-169
10.2495/D&NE-V4-N2-97-104
10.2495/D&N-V2-N4-319-327

We can live with a few odd cases like:

10.1023/A:1012776919384&token2=exp=1445542903~acl=/static/pdf/3/art%253A10.1023%252FA%253A10127769</nowiki>
10.1023/A:1017572119543&token2=exp=1444092499~acl=/static/pdf/341/art%253A10.1023%252FA%253A1017
10.1666/0094-8373%282004%29030%3C0203:PODITE%3E2.0.CO&rfr_id=info:sid/libx&rft.genre=article
10.7326/0003-4819-158-3-201302050-00003&an_fo_ed

There are then various extraneous unopened ) at the end of a few hundred DOIs, while extraneous brackets are only found as part of external links.

All in all

I'm running an extraction on the latest dumps and to get a clean list of DOIs I wil run the output through these sed commands, with regexes which IMHO can easily be incorporated in mwcites:

s,/(getPDF|currentissue|issue|abstract|summary|pdf|asset|full|homepage|otherversions|epdf|issuetoc|meta|pdb|references|suppinfo|citedby|standard|download|editorial-board|earlyview|aims|page|file).*$,,g
s,(/|"|;|;jsessionid=.+|"/?>.*|&lt;|\.|&amp|]]|:--&gt|</nowiki>|&gt;|&nbsp;|'*\.?\)?\[http.+|</small>)$,,g
s,&([a-z]+=.*)$,,g
s,^([^(]+)\.?\)$,\1,g

Now cut -f5,6 citations.tsv | grep ^doi | cut -f2 | sed --regexp-extended -f doiclean.sed | sort -u (which takes less than 3 seconds) is looking remarkably clean, though there are still odd mistakes like

10.1002/dac.1162,2010
10.1007/BF00558453.pdf
10.1007/BF01414807.org
10.1007/BF01761146http://www.churchomania.com/church/551912538158670/Gestalt+Pastoral+Care
10.1007/BF<sub>00660068</sub>
10.1007/s00228-008-0554-y.pdf
10.1007/s00381-013-2168-7</small>
10.1007/s10397-007<E2><80><93>0338-x
10.1007/s10530-010-9859-8.''
10.1007/s10531-004-5020-2>
10.1007/s10686-011-9275-9.(open
10.1016/S0140-6736(17)31492-7showArticle

nemobis · 2017-08-12T22:02:55Z

The cleanup reduces the latest dump extraction from 777452 to 765499 DOIs, which is a whopping 1.53745 % error correction. ;-)

halfak · 2017-08-18T14:43:20Z

Thanks for this thorough analysis. Just finished reading through it.

They are the only reserved characters, according to https://www.doi.org/doi_handbook/2_Numbering.html#2.5 Cf. mediawiki-utilities#7

Samwalton9 · 2018-05-11T12:47:17Z

Not sure if it's caught in the above issue or is a separate thing but we just ran into an issue with Google Maps URLs being caught as DOIs, because they look similar.

nemobis · 2018-06-08T22:01:59Z

Were those regexes incorporated in the latest release https://figshare.com/articles/Wikipedia_Scholarly_Article_Citations/1299540 or should I run them myself after downloading it?

nemobis · 2018-06-09T08:16:36Z

Self-answer: there's still all sorts of spurious DOIs (attached the whole list).
2018-03-23_dois.txt.gz

After applying my regexes above, the list goes from 1100422 to 1067405 lines.
2018-03-23_dois_cleaned.txt.gz

nemobis added a commit to nemobis/python-mwcites that referenced this issue Jul 7, 2017

Do not match second slash and dot in DOI

618032d

They are the only reserved characters, according to https://www.doi.org/doi_handbook/2_Numbering.html#2.5 Cf. mediawiki-utilities#7

nemobis mentioned this issue Jul 7, 2017

Do not match second slash and dot in DOI #11

Closed

nemobis mentioned this issue Aug 20, 2017

Reach out to authors of depositable DOIs used in Wikipedia dissemin/dissemin#251

Closed

Xarvalus mentioned this issue Feb 24, 2018

Do not match second slash and dot in DOI qamilnowak/python-mwcites#1

Merged

qamilnowak pushed a commit to qamilnowak/python-mwcites that referenced this issue Feb 24, 2018

Do not match second slash and dot in DOI (#1)

b0aec23

They are the only reserved characters, according to https://www.doi.org/doi_handbook/2_Numbering.html#2.5 Cf. mediawiki-utilities#7

nemobis mentioned this issue Dec 16, 2018

Match an URL as redundant with DOI even with extra fragments ms609/citation-bot#1122

Merged

nemobis mentioned this issue Aug 26, 2019

Release data for 2019 #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catching URL nonsense in a DOI #7

Catching URL nonsense in a DOI #7

halfak commented Oct 29, 2015

nemobis commented Jul 6, 2017 •

edited

Loading

halfak commented Jul 6, 2017

nemobis commented Jul 7, 2017

halfak commented Jul 7, 2017

nemobis commented Jul 7, 2017 •

edited

Loading

nemobis commented Jul 8, 2017

nemobis commented Aug 12, 2017 •

edited

Loading

nemobis commented Aug 12, 2017

halfak commented Aug 18, 2017

Samwalton9 commented May 11, 2018

nemobis commented Jun 8, 2018

nemobis commented Jun 9, 2018

Catching URL nonsense in a DOI #7

Catching URL nonsense in a DOI #7

Comments

halfak commented Oct 29, 2015

nemobis commented Jul 6, 2017 • edited Loading

halfak commented Jul 6, 2017

nemobis commented Jul 7, 2017

halfak commented Jul 7, 2017

nemobis commented Jul 7, 2017 • edited Loading

nemobis commented Jul 8, 2017

nemobis commented Aug 12, 2017 • edited Loading

Noise identification

All in all

nemobis commented Aug 12, 2017

halfak commented Aug 18, 2017

Samwalton9 commented May 11, 2018

nemobis commented Jun 8, 2018

nemobis commented Jun 9, 2018

nemobis commented Jul 6, 2017 •

edited

Loading

nemobis commented Jul 7, 2017 •

edited

Loading

nemobis commented Aug 12, 2017 •

edited

Loading