Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jaanisoe committed Oct 2, 2020
1 parent 120404b commit 4c1b910
Show file tree
Hide file tree
Showing 7 changed files with 12 additions and 16 deletions.
2 changes: 1 addition & 1 deletion docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ Fetcher contains the public method "getDoc", which is described in :ref:`Getting

The Fetcher methods "initPublication" and "initWebpage" must be used to construct a Publication and Webpage. Then, the methods "getPublication" and "getWebpage" can be used to fetch the Publication and Webpage. But instead of these "init" and "get" methods, the "getPublication", "getWebpage" and "getDoc" methods of class `PubFetcher <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/java/org/edamontology/pubfetcher/core/common/PubFetcher.java>`_ should be used, when possible.

Because executing JavaScript is prone to serious bugs in the used `HtmlUnit <http://htmlunit.sourceforge.net/>`_ library, fetching a HTML document with JavaScript support turned on is done in a separate `JavaScriptThread <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/java/org/edamontology/pubfetcher/core/fetching/JavascriptThread.java>`_, that can be killed if it gets stuck.
Because executing JavaScript is prone to serious bugs in the used `HtmlUnit <https://htmlunit.sourceforge.io/>`_ library, fetching a HTML document with JavaScript support turned on is done in a separate `JavaScriptThread <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/java/org/edamontology/pubfetcher/core/fetching/JavascriptThread.java>`_, that can be killed if it gets stuck.

The `HtmlMeta class <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/java/org/edamontology/pubfetcher/core/fetching/HtmlMeta.java>`_ is explained in :ref:`Meta <meta>` and the `Links class <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/java/org/edamontology/pubfetcher/core/fetching/Links.java>`_ in :ref:`Links <links>`.

Expand Down
4 changes: 2 additions & 2 deletions docs/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ _`fetchExceptionCooldown` ``1440`` If that many minutes have passe
_`retryLimit` ``3`` How many times can fetching be retried for an entry that is still empty, non-final or has a :ref:`fetchException <fetchexception>` after the initial attempt. Setting to ``0`` will disable retrying, unless the :ref:`retryCounter <retrycounter>` is reset by a cooldown in which case one initial attempt is allowed again. Setting to a negative value will disable this upper limit.
_`titleMinLength` ``4`` ``0`` Minimum length of a :ref:`usable <usable>` :ref:`publication <content_of_publications>` :ref:`title <fetcher_title>`
_`keywordsMinSize` ``2`` ``0`` Minimum size of a :ref:`usable <usable>` :ref:`publication <content_of_publications>` :ref:`keywords <fetcher_keywords>`/:ref:`MeSH <fetcher_mesh>` list
_`minedTermsMinSize` ``1`` ``0`` Minimum size of a :ref:`usable <usable>` :ref:`publication <content_of_publications>` :ref:`EFO <fetcher_efo>`/:ref:`GO <fetcher_go>` terms list
_`minedTermsMinSize` ``1`` ``0`` Minimum size of a :ref:`usable <usable>` :ref:`publication <content_of_publications>` :ref:`EFO <efo>`/:ref:`GO <go>` terms list
_`abstractMinLength` ``200`` ``0`` Minimum length of a :ref:`usable <usable>` :ref:`publication <content_of_publications>` :ref:`abstract <fetcher_theabstract>`
_`fulltextMinLength` ``2000`` ``0`` Minimum length of a :ref:`usable <usable>` :ref:`publication <content_of_publications>` :ref:`fulltext <fetcher_fulltext>`
_`webpageMinLength` ``50`` ``0`` Minimum length of a :ref:`usable webpage <webpage_usable>` combined :ref:`title <webpage_title>` and :ref:`content <webpage_content>`
Expand Down Expand Up @@ -423,7 +423,7 @@ Conditions that :ref:`publication part <publication_parts>`\ s must meet for the

Each parameter (except ``-part-empty``, ``-not-part-empty``, ``-part-usable``, ``-not-part-usable``, ``-part-final``, ``-not-part-final``) has a corresponding parameter specifying the publication parts that need to meet the condition given by the parameter. For example, ``-part-content`` gives a regular expression and ``-part-content-part`` lists all publication parts that must have a match with the given regular expression. If ``-part-content`` is specified, then ``-part-content-part`` must also be specified (and vice versa).

A publication part is any of: :ref:`the pmid <fetcher_pmid>`, :ref:`the pmcid <fetcher_pmcid>`, :ref:`the doi <fetcher_doi>`, :ref:`title <fetcher_title>`, :ref:`keywords <fetcher_keywords>`, :ref:`MeSH <fetcher_mesh>`, :ref:`EFO <fetcher_efo>`, :ref:`GO <fetcher_go>`, :ref:`theAbstract <fetcher_theabstract>`, :ref:`fulltext <fetcher_fulltext>`.
A publication part is any of: :ref:`the pmid <fetcher_pmid>`, :ref:`the pmcid <fetcher_pmcid>`, :ref:`the doi <fetcher_doi>`, :ref:`title <fetcher_title>`, :ref:`keywords <fetcher_keywords>`, :ref:`MeSH <fetcher_mesh>`, :ref:`EFO <efo>`, :ref:`GO <go>`, :ref:`theAbstract <fetcher_theabstract>`, :ref:`fulltext <fetcher_fulltext>`.

======================== ==================================================== ===========
Parameter Parameter args Description
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

project = 'PubFetcher'
author = 'Erik Jaaniso'
copyright = '2018-2019, Erik Jaaniso'
copyright = '2018-2020, Erik Jaaniso'
version = '1.0.1-SNAPSHOT'
release = '1.0.1-SNAPSHOT'

Expand Down
6 changes: 3 additions & 3 deletions docs/fetcher.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Low-level methods
Getting a HTML document
=======================

Fetching HTML (or XML) resources for both :ref:`publications <publications>` and :ref:`webpages <webpages>`/:ref:`docs <docs>` is done in the same method, where either the `jsoup <https://jsoup.org/>`_ or `HtmlUnit <http://htmlunit.sourceforge.net/>`_ libraries are used for getting the document. The HtmlUnit library has the advantage of supporting JavaScript, which needs to be executed to get the proper output for many sites, and it also works for some sites with problematic SSL certificates. As a disadvantage, it is a lot slower than jsoup, which is why using jsoup is the default and HtmlUnit is used only if JavaScript support is requested (or switched to automatically in case of some SSL exceptions). Also, fetching with JavaScript can get stuck for a few rare sites, in which case the misbehaving HtmlUnit code is terminated.
Fetching HTML (or XML) resources for both :ref:`publications <publications>` and :ref:`webpages <webpages>`/:ref:`docs <docs>` is done in the same method, where either the `jsoup <https://jsoup.org/>`_ or `HtmlUnit <https://htmlunit.sourceforge.io/>`_ libraries are used for getting the document. The HtmlUnit library has the advantage of supporting JavaScript, which needs to be executed to get the proper output for many sites, and it also works for some sites with problematic SSL certificates. As a disadvantage, it is a lot slower than jsoup, which is why using jsoup is the default and HtmlUnit is used only if JavaScript support is requested (or switched to automatically in case of some SSL exceptions). Also, fetching with JavaScript can get stuck for a few rare sites, in which case the misbehaving HtmlUnit code is terminated.

Supplied :ref:`fetching <fetching>` parameters :ref:`timeout <timeout>` and :ref:`userAgent <useragent>` are used for setting the connect timeout and the read timeout and the User-Agent HTTP header of connections. If getting the HTML document for a publication is successful and a list of already fetched links is supplied, then the current URL will be added to that list so that it is not tried again for the current publication. The successfully fetched document is returned to the caller for further processing.

Expand Down Expand Up @@ -123,7 +123,7 @@ The API is primarily meant for getting the fulltext_, but it can also be used to
Europe PMC mined
----------------

Europe PMC has text-mined terms from publication full texts. Such EFO terms can be obtained from https://www.ebi.ac.uk/europepmc/webservices/rest/PMC/{PMCID}/textMinedTerms/EFO or https://www.ebi.ac.uk/europepmc/webservices/rest/MED/{PMID}/textMinedTerms/EFO and GO terms can be obtained from the same URLs where "EFO" is replaced with "GO_TERM". These resources are the only way to fill the `publication parts`_ efo_ and go_ and only those publication parts can be obtained from these resources. Either a PMID_ or a PMCID_ is required to query these resources.
Europe PMC has text-mined terms from publication full texts. These can be fetched from the API endpoint https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds, documentation of the Annotations API is at https://europepmc.org/AnnotationsApi. These resources are the only way to fill the `publication parts`_ efo_ and go_ and only those publication parts can be obtained from these resources (type "Gene Ontology" is used for GO and type "Experimental Methods" for EFO). Either a PMID_ or a PMCID_ is required to query these resources.

.. _pubmed_xml:

Expand Down Expand Up @@ -320,7 +320,7 @@ _`mesh`
_`efo`
.. _fetcher_efo:

`Experimental factor ontology <https://www.ebi.ac.uk/efo/>`_ terms of the publication. Text-mined by the `Europe PMC <https://europepmc.org/>`_ project from the full text of the article. The :ref:`efo structure <efo>`.
`Experimental factor ontology <https://www.ebi.ac.uk/efo/>`_ terms of the publication (but also experimental methods terms from other ontologies like `Molecular Interactions Controlled Vocabulary <https://github.com/HUPO-PSI/psi-mi-CV>`_ and `Ontology for Biomedical Investigations <http://obi-ontology.org/>`_). Text-mined by the `Europe PMC <https://europepmc.org/>`_ project from the full text of the article. The :ref:`efo structure <efo>`.
_`go`
.. _fetcher_go:

Expand Down
2 changes: 1 addition & 1 deletion docs/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Ideally, all scientific literature would be open and easily accessible through o

The speed of downloading, when :ref:`multithreading <multithreaded>` is enabled, is roughly one publication per second. This limitation, along with the desire to not overburden the used APIs and publisher sites, means that PubFetcher is best used for medium-scale processing of publications, where the number of entries is in the thousands and not in the millions, but where the largest amount of completeness for these few thousand publications is desired. If millions of publications are required, then it is better to restrict oneself to the Open Access subset, which can be downloaded in bulk: https://europepmc.org/downloads.

In addition to the main content of a publication (:ref:`title <fetcher_title>`, :ref:`abstract <fetcher_theabstract>` and :ref:`full text <fetcher_fulltext>`), PubFetcher supports getting different keywords about the publication: the :ref:`user-assigned keywords <fetcher_keywords>`, the :ref:`MeSH terms <fetcher_mesh>` as assigned in PubMed and :ref:`EFO terms <fetcher_efo>` and :ref:`GO terms <fetcher_go>` as mined from the full text by Europe PMC. Each publication has up to three identificators: a :ref:`PMID <fetcher_pmid>`, a :ref:`PMCID <fetcher_pmcid>` and a :ref:`DOI <fetcher_doi>`. In addition, different metadata (found from the different :ref:`resources <resources>`) about a publication is saved, like whether the article is :ref:`Open Access <oa>`, the :ref:`journal <journaltitle>` where it was published, the :ref:`publication date <pubdate>`, etc. The :ref:`source <publication_types>` of each :ref:`publication part <publication_parts>` is remembered, with content from a higher confidence resource potentially overwriting the current content. It is possible to fetch only some :ref:`publication parts <publication_parts>` (thus avoiding querying some :ref:`resources <resources>`) and there is :ref:`an algorithm <can_fetch>` to determine if an already existing entry should be refetched or is it complete enough. Fetching and :ref:`extracting <selecting>` of content is done using various Java libraries with support for :ref:`JavaScript <getting_a_html_document>` and :ref:`PDF <getting_a_pdf_document>` files. The downloaded publications can be persisted to disk to a :ref:`key-value store <database>` for later analysis. A number of :ref:`built-in rules <rules_in_yaml>` are included (along with :ref:`tests <testing_of_rules>`) for :ref:`scraping <scraping>` publication parts from publisher sites, but additional rules can also be defined. Currently, there is support for around 50 publishers of journals and 25 repositories of tools and tools' metadata and documentation and around 750 test cases for the rules have been defined. If no rules are defined for a given site, then :ref:`automatic cleaning <cleaning>` is applied to get the main content of the page.
In addition to the main content of a publication (:ref:`title <fetcher_title>`, :ref:`abstract <fetcher_theabstract>` and :ref:`full text <fetcher_fulltext>`), PubFetcher supports getting different keywords about the publication: the :ref:`user-assigned keywords <fetcher_keywords>`, the :ref:`MeSH terms <fetcher_mesh>` as assigned in PubMed and :ref:`EFO terms <efo>` and :ref:`GO terms <go>` as mined from the full text by Europe PMC. Each publication has up to three identificators: a :ref:`PMID <fetcher_pmid>`, a :ref:`PMCID <fetcher_pmcid>` and a :ref:`DOI <fetcher_doi>`. In addition, different metadata (found from the different :ref:`resources <resources>`) about a publication is saved, like whether the article is :ref:`Open Access <oa>`, the :ref:`journal <journaltitle>` where it was published, the :ref:`publication date <pubdate>`, etc. The :ref:`source <publication_types>` of each :ref:`publication part <publication_parts>` is remembered, with content from a higher confidence resource potentially overwriting the current content. It is possible to fetch only some :ref:`publication parts <publication_parts>` (thus avoiding querying some :ref:`resources <resources>`) and there is :ref:`an algorithm <can_fetch>` to determine if an already existing entry should be refetched or is it complete enough. Fetching and :ref:`extracting <selecting>` of content is done using various Java libraries with support for :ref:`JavaScript <getting_a_html_document>` and :ref:`PDF <getting_a_pdf_document>` files. The downloaded publications can be persisted to disk to a :ref:`key-value store <database>` for later analysis. A number of :ref:`built-in rules <rules_in_yaml>` are included (along with :ref:`tests <testing_of_rules>`) for :ref:`scraping <scraping>` publication parts from publisher sites, but additional rules can also be defined. Currently, there is support for around 50 publishers of journals and 25 repositories of tools and tools' metadata and documentation and around 750 test cases for the rules have been defined. If no rules are defined for a given site, then :ref:`automatic cleaning <cleaning>` is applied to get the main content of the page.

PubFetcher has an extensive :ref:`command-line tool <cli>` to use all of its functionality. It contains a few :ref:`helper operations <simple_one_off_operations>`, but the main use is the construction of a simple :ref:`pipeline <pipeline>` for querying, fetching and outputting of publications and general and documentation web pages: first IDs of interest are specified/loaded and filtered, then corresponding content fetched/loaded and filtered, and last it is possible to output the results or store them to a database. Among other functionality, content and all the metadata can be output in :ref:`HTML or plain text <html_and_plain_text_output>`, but also :ref:`exported <export_to_json>` to :ref:`JSON <json_output>`. All fetching operations can be influenced by a few :ref:`general parameters <general_parameters>`. Progress along with error messages is logged to the console and to a :ref:`log file <log_file>`, if specified. The command-line tool can be :ref:`extended <cli_extended>`, for example to add new ways of loading IDs.

Expand Down
10 changes: 3 additions & 7 deletions docs/output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -294,7 +294,7 @@ publications

.. _efo:
efo
Publication part representing publication EFO terms. Structure same as in pmid_, except content_ is replaced with "list" and size_ is number of elements in "list".
Publication part representing publication EFO and other experimental methods terms. Structure same as in pmid_, except content_ is replaced with "list" and size_ is number of elements in "list".

list
Array of objects representing publication EFO terms
Expand All @@ -303,12 +303,8 @@ publications
Term name
count
Number of times the term was mined from full text by :ref:`Europe PMC <europe_pmc>`
altNames
Array of strings representing alternative names for the term
dbName
Database name (e.g., ``efo``, ``GO``)
dbIds
Array of strings representing term IDs in the database
uri
Unique URI to the ontology term

.. _go:
go
Expand Down
Loading

0 comments on commit 4c1b910

Please sign in to comment.