Home

Welcome to the CitationCorpus wiki!

The goal of this first project is to modularize and unify the processing pipelines of the arXiv and the PubMedCentral (PMC) import scripts written by Alex Dutton and me (Heinrich) in order to keep make more relieable and easier to adapt for the continues import that will be added later on.

We will use a Messaging Queue system in the background based on redis (@Ben is that correct? redis seems to be a KeyValue Store).

Outline of the processing pipeline

Import

Input Sources                                      Intermediate format

{arXiv *.tar.gz file} ---[Arxiv Preprocessor]--+
                                               +----> {BibJSON}
{PMC *.NXML file} -------[PMC Preproc.]--------+

The starting point of the pipeline is one article packed as a tar.gz in the arxiv case and as an nxml for the PMC articles. These will be filled in by an extraction script which will be written separately.
Endpoint is a BibJSON file containing the metadata and reference lists of each individual article.

Question: What to do with the citation targets?
a. Create a BibJSON file for each target
b. Keep citation strings in the BibJSON of the citing article.

Con a. If the arxiv/PMC gets cited we get duplicate BIBjson files! Con b. Unsystematic. If we have a lot of metadata for the citation we need to create a BibJSON in any case.

Processing

{BibJSON} --[Augmentation]--[Cleanup]--[Clustering]--> {BibJSON}

Augmentation:
- Querry web API's for additional information about the citation target (currently PMC only)
- Match citation strings against metadatabase (currently arxiv only)
Cleanup
- As done by Alex: Restore URLs, ...
Clustering - Which citation targets are actually the same?
Create a new, authorative record, with a note which records should be replaced by it.
Apply merging strategies used by Alex. e.g.
- many authors are better then few
- Keep all accents ...

Question: What do we do with authors? Do we cluster them, too, in some way? We might have some email addresses.

Export

 {BibJSON} --[RDF converter]--> {RDF/XML/nq/ttl}

Use David's mapping to get nice semantic data.

Others

by Rene: regarding mail adresses and foaf profiles: we can probably have linked open data crawler or linked open data search engine look for rdf foaf resources of the authors that we matched correctly and get other information from them!

Provide feedback

Saved searches