Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not match second slash and dot in DOI #11

Closed
wants to merge 1 commit into from

Conversation

nemobis
Copy link

@nemobis nemobis commented Jul 7, 2017

They are the only reserved characters, according to
https://www.doi.org/doi_handbook/2_Numbering.html#2.5

Cf. #7

@halfak
Copy link
Member

halfak commented Jul 7, 2017

Can you add some test cases to demonstrate that this does the right thing (and not some likely wrong things)?

@nemobis
Copy link
Author

nemobis commented Jul 7, 2017

Do you mean in the python-mwcites/datasets/mw_dump_stub.xml file? Maybe, but for now I'll focus on testing a regex that gets good output for me (on it.wiki).

@nemobis
Copy link
Author

nemobis commented Jul 7, 2017

Simple grepping à la pbzip2 -dc itwiki-20170620-pages-articles-multistream.xml.bz2 | grep "10\." | grep -Eo '10\.[[:digit:]]+/[^./[:space:]}?,|]+' shows quite a few DOIs with dots from a couple publishers (like 10.1016/j.bcp.2007.07.045 ), so maybe we should ignore that part.

@nemobis
Copy link
Author

nemobis commented Jul 8, 2017

Let's continue on the issue

@nemobis nemobis closed this Jul 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants