Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catching URL nonsense in a DOI #7

Open
halfak opened this issue Oct 29, 2015 · 12 comments
Open

Catching URL nonsense in a DOI #7

halfak opened this issue Oct 29, 2015 · 12 comments

Comments

@halfak
Copy link
Member

halfak commented Oct 29, 2015

We should stop processing a DOI when we see an important URL character -- e.g. "?", "&" or "#".

Still there are some DOIs that have these characters in them. e.g. 10.1002/(SICI)1097-0142(19960401)77:7<1356::AID-CNCR20>3.0.CO;2-#

But most of the time it's because we're processing a messy URL.

@nemobis
Copy link

nemobis commented Jul 6, 2017

10.1002/(SICI)1097-0142(19960401)77:7<1356::AID-CNCR20>3.0.CO;2-#

I've met this one too!

doi_isbn_pubmed_and_arxiv.enwiki-20161201.tsv.bz2 also contains some /download such as:

10.18352/bmgn-lchr.7417/galley/7464/download
10.4038/tapro.v5i1.5654/galley/4523/download

Or /getPDF:

10.00024/bdef:TuftsPDF/getPDF
10.00025/bdef:TuftsPDF/getPDF

And even:

10.1002/14356007.a12_495/abstract;jsessionid=EFC500556A6060AC9BEC57789816DC84.f01t01

@halfak
Copy link
Member Author

halfak commented Jul 6, 2017

That "/download" could be a perfectly valid part of a DOI. At this point, we'll be implementing heuristics to know when to stop processing a DOI. It seems we could have a list of known suffix patterns that we should strip -- like extensions, "/download", "/getPDF", etc. That would mean DOIs that actually had that as part of the DOI would be broken, but in the end, I expect this will be more useful.

@nemobis
Copy link

nemobis commented Jul 7, 2017

That "/download" could be a perfectly valid part of a DOI.

Are you sure? This is not how I read https://www.doi.org/doi_handbook/2_Numbering.html#2.5 :

Handle syntax imposes two constraints on the prefix — both slash and dot are "reserved characters", with the slash separating the prefix from the suffix and the dot used to extend sub prefixes.

AFAICT, everything starting with the second / in a matched string should be dropped.

nemobis added a commit to nemobis/python-mwcites that referenced this issue Jul 7, 2017
@halfak
Copy link
Member Author

halfak commented Jul 7, 2017

Oh good. I hadn't caught that in the spec.

@nemobis
Copy link

nemobis commented Jul 7, 2017

Ah, I indeed misread the sentence, which is preceded by

Neither the Handle System nor DOI system policies, nor any web use currently imaginable, impose any constraints on the suffix, outside of encoding (see below).

Of course a DOI like 10./1234/abcdef or 10..1234/abcdef would be invalid. In the suffix, the dot is frequently used, but I've yet to find any slash. Sadly there's also stuff like:

10.1671/0272-4634(2002)022[0564:EAEFTC]2.0.CO;2

Which I think the current regex doesn't match.

@nemobis
Copy link

nemobis commented Jul 8, 2017

I think I found something now:

10.1093/jac/dkh029
10.1088/0953-2048/20/8/L03
10.1093/jac/39.3.393

@nemobis
Copy link

nemobis commented Aug 12, 2017

Noise identification

So these are the suffixes in the dataset:

grep -Eo "/[a-z]+$" doi.enwiki-20161201.txt | sort | uniq -c | sort -nr
   6523 /abstract
   1674 /full
   1243 /pdf
    505 /issues
    416 /currentissue
    216 /epdf
    114 /issuetoc
     90 /summary
     76 /meta
     32 /pdb
     17 /a
      9 /references
      7 /suppinfo
      5 /otherversions
      5 /citedby
      5 /b
      4 /c
      3 /standard
      3 /j
      2 /e
      2 /download
      2 /deaths
      2 /d
      2 /core
      1 /wu
      1 /wright
      1 /towne
      1 /topics
      1 /sys
      1 /stadaf
      1 /soeknr
      1 /science
      1 /sce
      1 /s
      1 /rstl
      1 /rspb
      1 /rra
      1 /rob
      1 /ref
      1 /rcm
      1 /ppi
      1 /polb
      1 /pletnik
      1 /panetti
      1 /p
      1 /nsm
      1 /metrics
      1 /masai
      1 /marks
      1 /lt
      1 /lrshef
      1 /nsm
      1 /metrics
      1 /masai
      1 /marks
      1 /lt
      1 /lrshef
      1 /lo
      1 /komatsu
      1 /kim
      1 /kier
      1 /journal
      1 /job
      1 /jid
      1 /jacsm
      1 /itj
      1 /isom
      1 /ijhit
      1 /ic
      1 /hrdq
      1 /home
      1 /gt
      1 /goldbook
      1 /gm
      1 /g
      1 /fsu
      1 /fneng
      1 /figures
      1 /erg
      1 /enu
      1 /enhanced
      1 /earlyview
      1 /djlit
      1 /dev
      1 /dcsupplemental
      1 /dawson
      1 /cst
      1 /comments
      1 /cne
      1 /cleaver
      1 /chemse
      1 /bjmcs
      1 /beej
      1 /bay
      1 /azl
      1 /articledoi
      1 /armulik
      1 /albers
      1 /ai
      1 /acref
      1 /abstrac
      1 /abstact

There's also a need to URL-decode and HTML-unescape some DOIs like

10.1002/(SICI)1096-8644(199602)99:2&lt;345::AID-AJPA9&gt;3.0.CO;2-X
10.1644/1545-1542(2000)081&lt;1025:PROPGG&gt;2.0.CO;2
10.1666/0094-8373(2000)026&lt;0450:FPINDI&gt;2.0.CO;2
10.1002/(SICI)1096-8644(199602)99:2&lt;345::AID-AJPA9&gt;3.0.CO;2-X
10.1093/acref/9780199666317.001.0001/acref-9780199666317-e-4513&gt;
10.1666/0094-8373%282004%29030%3C0203:PODITE%3E2.0.CO&rfr_id=info:sid/libx&rft.genre=article
10.1111/j.1558&ndash;5646.2007.00179.x

The only legit DOIs containing a `&' are:

10.1075/p&c.16.2.07kel
10.11588/ai.2006.1&2.11114
10.1207/s15324834basp2602&3_7
10.1207/s1532690xci1002&3_2
10.1207/s15326985ep2603&4_6
10.1207/s15327043hup1102&3_3
10.1207/s15327051hci0603&4_6
10.1207/s15327663jcp1401&2_19
10.1207/s15327698jfc0403&4_5
10.1207/S15327728JMME1602&3_4
10.1207/S15327965PLI1403&4_17
10.1207/s15327965pli1403&4_21
10.1207/S15327965PLI1403&4_9
10.1207/s15427439tc1202&3_6
10.1207/s15473341lld0103&4_2
10.2495/D&NE-V4-N2-154-169
10.2495/D&NE-V4-N2-97-104
10.2495/D&N-V2-N4-319-327

As found by a search grep '&' doi.enwiki-20161201.txt | grep -vE '&(pgs|magic|cookie|prog|title|volume|spage|issn|date|issue|search|ct|term|representation|uid|image|ttl|rft|return|item|bypass|vmode|utm|typ|tab|hl|er|code).+' for the most common "suffixes" according to grep -Eo '&[a-z]+' doi.enwiki-20161201.txt | sort | uniq -c | sort -nr

We can live with a few odd cases like:

10.1023/A:1012776919384&token2=exp=1445542903~acl=/static/pdf/3/art%253A10.1023%252FA%253A10127769</nowiki>
10.1023/A:1017572119543&token2=exp=1444092499~acl=/static/pdf/341/art%253A10.1023%252FA%253A1017
10.1666/0094-8373%282004%29030%3C0203:PODITE%3E2.0.CO&rfr_id=info:sid/libx&rft.genre=article
10.7326/0003-4819-158-3-201302050-00003&an_fo_ed

There are then various extraneous unopened ) at the end of a few hundred DOIs, while extraneous brackets are only found as part of external links.

All in all

I'm running an extraction on the latest dumps and to get a clean list of DOIs I wil run the output through these sed commands, with regexes which IMHO can easily be incorporated in mwcites:

s,/(getPDF|currentissue|issue|abstract|summary|pdf|asset|full|homepage|otherversions|epdf|issuetoc|meta|pdb|references|suppinfo|citedby|standard|download|editorial-board|earlyview|aims|page|file).*$,,g
s,(/|"|;|;jsessionid=.+|"/?>.*|&lt;|\.|&amp|]]|:--&gt|</nowiki>|&gt;|&nbsp;|'*\.?\)?\[http.+|</small>)$,,g
s,&([a-z]+=.*)$,,g
s,^([^(]+)\.?\)$,\1,g

Now cut -f5,6 citations.tsv | grep ^doi | cut -f2 | sed --regexp-extended -f doiclean.sed | sort -u (which takes less than 3 seconds) is looking remarkably clean, though there are still odd mistakes like

10.1002/dac.1162,2010
10.1007/BF00558453.pdf
10.1007/BF01414807.org
10.1007/BF01761146http://www.churchomania.com/church/551912538158670/Gestalt+Pastoral+Care
10.1007/BF<sub>00660068</sub>
10.1007/s00228-008-0554-y.pdf
10.1007/s00381-013-2168-7</small>
10.1007/s10397-007<E2><80><93>0338-x
10.1007/s10530-010-9859-8.''
10.1007/s10531-004-5020-2>
10.1007/s10686-011-9275-9.(open
10.1016/S0140-6736(17)31492-7showArticle

@nemobis
Copy link

nemobis commented Aug 12, 2017

The cleanup reduces the latest dump extraction from 777452 to 765499 DOIs, which is a whopping 1.53745 % error correction. ;-)

@halfak
Copy link
Member Author

halfak commented Aug 18, 2017

Thanks for this thorough analysis. Just finished reading through it.

@Samwalton9
Copy link

Not sure if it's caught in the above issue or is a separate thing but we just ran into an issue with Google Maps URLs being caught as DOIs, because they look similar.

@nemobis
Copy link

nemobis commented Jun 8, 2018

Were those regexes incorporated in the latest release https://figshare.com/articles/Wikipedia_Scholarly_Article_Citations/1299540 or should I run them myself after downloading it?

@nemobis
Copy link

nemobis commented Jun 9, 2018

Self-answer: there's still all sorts of spurious DOIs (attached the whole list).
2018-03-23_dois.txt.gz

After applying my regexes above, the list goes from 1100422 to 1067405 lines.
2018-03-23_dois_cleaned.txt.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants