Skip to content
petermr edited this page Jul 24, 2021 · 3 revisions

Frequently Asked Questions for pygetpapers

What can pygetpapers do?

  • Search repository or publishers sites for scholarly articles.
  • Iteratively improves queries from dictionaries and previous searches.
  • provide a unified system to cover many different sites.
  • integrate with downstream content-mining and analysis.

Can pygetpapers search repository "FOO"?

  • pygetpapers is modular and designed for RESTful APIs. It has modules for EuropePMC (EPMC) (fulltext), preprint servers: arXiv, biorxiv, medrxiv, rkivist and metadata server: crossref.
  • if you are familiar with the content and manual search it is relatively easy to add code for a new RESTFul repository. Note that the socio-legal aspects are often critical (copyright, server load, etc.)

Where is my data stored?

  • pygetpapers stores all data (fulltexts, metadata, analyses, etc.) on your machine, wherever you choose.

Do I have to know python?

No. Currently you have to install Python but there are simple tested commands for this. Later we may package everything as docker or Jupyter Notebooks

Does pygetpapers store a record of my searches?

Not by default. There is an optional LOGfile which stores the query and records downloads. We are working on integrating pygetpapers into Jupyter Notebooks so complex workflows can be re-run.

Can pygetpapers be run as a server?

It is not currently packaged as a server, but we are prototyping Cloud solutions and adding it to ?[ForgetName]?

What resource problems does pygetpapers have?

  • pygetpapers is generally embarrassingly parallel. The main resource is bandwidth and remote server capacity. Several jobs can be run simultaneously , e.g. by division by publication-date slices. The main concern is not to overload the remote server, creating a denial of service so be careful.
  • downloaded files can be quite large (e.g. 20+MB PDFs) so 10_000 files might take 50 GB.