From 4c1b910f0bfd1ac03746d1f6577ec14c389165a4 Mon Sep 17 00:00:00 2001 From: Erik Jaaniso Date: Sat, 3 Oct 2020 01:10:41 +0300 Subject: [PATCH] update docs --- docs/api.rst | 2 +- docs/cli.rst | 4 ++-- docs/conf.py | 2 +- docs/fetcher.rst | 6 +++--- docs/intro.rst | 2 +- docs/output.rst | 10 +++------- docs/scraping.rst | 2 +- 7 files changed, 12 insertions(+), 16 deletions(-) diff --git a/docs/api.rst b/docs/api.rst index 5213857..01805e5 100644 --- a/docs/api.rst +++ b/docs/api.rst @@ -45,7 +45,7 @@ Fetcher contains the public method "getDoc", which is described in :ref:`Getting The Fetcher methods "initPublication" and "initWebpage" must be used to construct a Publication and Webpage. Then, the methods "getPublication" and "getWebpage" can be used to fetch the Publication and Webpage. But instead of these "init" and "get" methods, the "getPublication", "getWebpage" and "getDoc" methods of class `PubFetcher `_ should be used, when possible. -Because executing JavaScript is prone to serious bugs in the used `HtmlUnit `_ library, fetching a HTML document with JavaScript support turned on is done in a separate `JavaScriptThread `_, that can be killed if it gets stuck. +Because executing JavaScript is prone to serious bugs in the used `HtmlUnit `_ library, fetching a HTML document with JavaScript support turned on is done in a separate `JavaScriptThread `_, that can be killed if it gets stuck. The `HtmlMeta class `_ is explained in :ref:`Meta ` and the `Links class `_ in :ref:`Links `. diff --git a/docs/cli.rst b/docs/cli.rst index bc21ba9..6531d1e 100644 --- a/docs/cli.rst +++ b/docs/cli.rst @@ -53,7 +53,7 @@ _`fetchExceptionCooldown` ``1440`` If that many minutes have passe _`retryLimit` ``3`` How many times can fetching be retried for an entry that is still empty, non-final or has a :ref:`fetchException ` after the initial attempt. Setting to ``0`` will disable retrying, unless the :ref:`retryCounter ` is reset by a cooldown in which case one initial attempt is allowed again. Setting to a negative value will disable this upper limit. _`titleMinLength` ``4`` ``0`` Minimum length of a :ref:`usable ` :ref:`publication ` :ref:`title ` _`keywordsMinSize` ``2`` ``0`` Minimum size of a :ref:`usable ` :ref:`publication ` :ref:`keywords `/:ref:`MeSH ` list -_`minedTermsMinSize` ``1`` ``0`` Minimum size of a :ref:`usable ` :ref:`publication ` :ref:`EFO `/:ref:`GO ` terms list +_`minedTermsMinSize` ``1`` ``0`` Minimum size of a :ref:`usable ` :ref:`publication ` :ref:`EFO `/:ref:`GO ` terms list _`abstractMinLength` ``200`` ``0`` Minimum length of a :ref:`usable ` :ref:`publication ` :ref:`abstract ` _`fulltextMinLength` ``2000`` ``0`` Minimum length of a :ref:`usable ` :ref:`publication ` :ref:`fulltext ` _`webpageMinLength` ``50`` ``0`` Minimum length of a :ref:`usable webpage ` combined :ref:`title ` and :ref:`content ` @@ -423,7 +423,7 @@ Conditions that :ref:`publication part `\ s must meet for the Each parameter (except ``-part-empty``, ``-not-part-empty``, ``-part-usable``, ``-not-part-usable``, ``-part-final``, ``-not-part-final``) has a corresponding parameter specifying the publication parts that need to meet the condition given by the parameter. For example, ``-part-content`` gives a regular expression and ``-part-content-part`` lists all publication parts that must have a match with the given regular expression. If ``-part-content`` is specified, then ``-part-content-part`` must also be specified (and vice versa). -A publication part is any of: :ref:`the pmid `, :ref:`the pmcid `, :ref:`the doi `, :ref:`title `, :ref:`keywords `, :ref:`MeSH `, :ref:`EFO `, :ref:`GO `, :ref:`theAbstract `, :ref:`fulltext `. +A publication part is any of: :ref:`the pmid `, :ref:`the pmcid `, :ref:`the doi `, :ref:`title `, :ref:`keywords `, :ref:`MeSH `, :ref:`EFO `, :ref:`GO `, :ref:`theAbstract `, :ref:`fulltext `. ======================== ==================================================== =========== Parameter Parameter args Description diff --git a/docs/conf.py b/docs/conf.py index 26d1545..085847f 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -3,7 +3,7 @@ project = 'PubFetcher' author = 'Erik Jaaniso' -copyright = '2018-2019, Erik Jaaniso' +copyright = '2018-2020, Erik Jaaniso' version = '1.0.1-SNAPSHOT' release = '1.0.1-SNAPSHOT' diff --git a/docs/fetcher.rst b/docs/fetcher.rst index 15fc1c9..fb8af97 100644 --- a/docs/fetcher.rst +++ b/docs/fetcher.rst @@ -16,7 +16,7 @@ Low-level methods Getting a HTML document ======================= -Fetching HTML (or XML) resources for both :ref:`publications ` and :ref:`webpages `/:ref:`docs ` is done in the same method, where either the `jsoup `_ or `HtmlUnit `_ libraries are used for getting the document. The HtmlUnit library has the advantage of supporting JavaScript, which needs to be executed to get the proper output for many sites, and it also works for some sites with problematic SSL certificates. As a disadvantage, it is a lot slower than jsoup, which is why using jsoup is the default and HtmlUnit is used only if JavaScript support is requested (or switched to automatically in case of some SSL exceptions). Also, fetching with JavaScript can get stuck for a few rare sites, in which case the misbehaving HtmlUnit code is terminated. +Fetching HTML (or XML) resources for both :ref:`publications ` and :ref:`webpages `/:ref:`docs ` is done in the same method, where either the `jsoup `_ or `HtmlUnit `_ libraries are used for getting the document. The HtmlUnit library has the advantage of supporting JavaScript, which needs to be executed to get the proper output for many sites, and it also works for some sites with problematic SSL certificates. As a disadvantage, it is a lot slower than jsoup, which is why using jsoup is the default and HtmlUnit is used only if JavaScript support is requested (or switched to automatically in case of some SSL exceptions). Also, fetching with JavaScript can get stuck for a few rare sites, in which case the misbehaving HtmlUnit code is terminated. Supplied :ref:`fetching ` parameters :ref:`timeout ` and :ref:`userAgent ` are used for setting the connect timeout and the read timeout and the User-Agent HTTP header of connections. If getting the HTML document for a publication is successful and a list of already fetched links is supplied, then the current URL will be added to that list so that it is not tried again for the current publication. The successfully fetched document is returned to the caller for further processing. @@ -123,7 +123,7 @@ The API is primarily meant for getting the fulltext_, but it can also be used to Europe PMC mined ---------------- -Europe PMC has text-mined terms from publication full texts. Such EFO terms can be obtained from https://www.ebi.ac.uk/europepmc/webservices/rest/PMC/{PMCID}/textMinedTerms/EFO or https://www.ebi.ac.uk/europepmc/webservices/rest/MED/{PMID}/textMinedTerms/EFO and GO terms can be obtained from the same URLs where "EFO" is replaced with "GO_TERM". These resources are the only way to fill the `publication parts`_ efo_ and go_ and only those publication parts can be obtained from these resources. Either a PMID_ or a PMCID_ is required to query these resources. +Europe PMC has text-mined terms from publication full texts. These can be fetched from the API endpoint https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds, documentation of the Annotations API is at https://europepmc.org/AnnotationsApi. These resources are the only way to fill the `publication parts`_ efo_ and go_ and only those publication parts can be obtained from these resources (type "Gene Ontology" is used for GO and type "Experimental Methods" for EFO). Either a PMID_ or a PMCID_ is required to query these resources. .. _pubmed_xml: @@ -320,7 +320,7 @@ _`mesh` _`efo` .. _fetcher_efo: - `Experimental factor ontology `_ terms of the publication. Text-mined by the `Europe PMC `_ project from the full text of the article. The :ref:`efo structure `. + `Experimental factor ontology `_ terms of the publication (but also experimental methods terms from other ontologies like `Molecular Interactions Controlled Vocabulary `_ and `Ontology for Biomedical Investigations `_). Text-mined by the `Europe PMC `_ project from the full text of the article. The :ref:`efo structure `. _`go` .. _fetcher_go: diff --git a/docs/intro.rst b/docs/intro.rst index 6e74d4a..8f1385e 100644 --- a/docs/intro.rst +++ b/docs/intro.rst @@ -16,7 +16,7 @@ Ideally, all scientific literature would be open and easily accessible through o The speed of downloading, when :ref:`multithreading ` is enabled, is roughly one publication per second. This limitation, along with the desire to not overburden the used APIs and publisher sites, means that PubFetcher is best used for medium-scale processing of publications, where the number of entries is in the thousands and not in the millions, but where the largest amount of completeness for these few thousand publications is desired. If millions of publications are required, then it is better to restrict oneself to the Open Access subset, which can be downloaded in bulk: https://europepmc.org/downloads. -In addition to the main content of a publication (:ref:`title `, :ref:`abstract ` and :ref:`full text `), PubFetcher supports getting different keywords about the publication: the :ref:`user-assigned keywords `, the :ref:`MeSH terms ` as assigned in PubMed and :ref:`EFO terms ` and :ref:`GO terms ` as mined from the full text by Europe PMC. Each publication has up to three identificators: a :ref:`PMID `, a :ref:`PMCID ` and a :ref:`DOI `. In addition, different metadata (found from the different :ref:`resources `) about a publication is saved, like whether the article is :ref:`Open Access `, the :ref:`journal ` where it was published, the :ref:`publication date `, etc. The :ref:`source ` of each :ref:`publication part ` is remembered, with content from a higher confidence resource potentially overwriting the current content. It is possible to fetch only some :ref:`publication parts ` (thus avoiding querying some :ref:`resources `) and there is :ref:`an algorithm ` to determine if an already existing entry should be refetched or is it complete enough. Fetching and :ref:`extracting ` of content is done using various Java libraries with support for :ref:`JavaScript ` and :ref:`PDF ` files. The downloaded publications can be persisted to disk to a :ref:`key-value store ` for later analysis. A number of :ref:`built-in rules ` are included (along with :ref:`tests `) for :ref:`scraping ` publication parts from publisher sites, but additional rules can also be defined. Currently, there is support for around 50 publishers of journals and 25 repositories of tools and tools' metadata and documentation and around 750 test cases for the rules have been defined. If no rules are defined for a given site, then :ref:`automatic cleaning ` is applied to get the main content of the page. +In addition to the main content of a publication (:ref:`title `, :ref:`abstract ` and :ref:`full text `), PubFetcher supports getting different keywords about the publication: the :ref:`user-assigned keywords `, the :ref:`MeSH terms ` as assigned in PubMed and :ref:`EFO terms ` and :ref:`GO terms ` as mined from the full text by Europe PMC. Each publication has up to three identificators: a :ref:`PMID `, a :ref:`PMCID ` and a :ref:`DOI `. In addition, different metadata (found from the different :ref:`resources `) about a publication is saved, like whether the article is :ref:`Open Access `, the :ref:`journal ` where it was published, the :ref:`publication date `, etc. The :ref:`source ` of each :ref:`publication part ` is remembered, with content from a higher confidence resource potentially overwriting the current content. It is possible to fetch only some :ref:`publication parts ` (thus avoiding querying some :ref:`resources `) and there is :ref:`an algorithm ` to determine if an already existing entry should be refetched or is it complete enough. Fetching and :ref:`extracting ` of content is done using various Java libraries with support for :ref:`JavaScript ` and :ref:`PDF ` files. The downloaded publications can be persisted to disk to a :ref:`key-value store ` for later analysis. A number of :ref:`built-in rules ` are included (along with :ref:`tests `) for :ref:`scraping ` publication parts from publisher sites, but additional rules can also be defined. Currently, there is support for around 50 publishers of journals and 25 repositories of tools and tools' metadata and documentation and around 750 test cases for the rules have been defined. If no rules are defined for a given site, then :ref:`automatic cleaning ` is applied to get the main content of the page. PubFetcher has an extensive :ref:`command-line tool ` to use all of its functionality. It contains a few :ref:`helper operations `, but the main use is the construction of a simple :ref:`pipeline ` for querying, fetching and outputting of publications and general and documentation web pages: first IDs of interest are specified/loaded and filtered, then corresponding content fetched/loaded and filtered, and last it is possible to output the results or store them to a database. Among other functionality, content and all the metadata can be output in :ref:`HTML or plain text `, but also :ref:`exported ` to :ref:`JSON `. All fetching operations can be influenced by a few :ref:`general parameters `. Progress along with error messages is logged to the console and to a :ref:`log file `, if specified. The command-line tool can be :ref:`extended `, for example to add new ways of loading IDs. diff --git a/docs/output.rst b/docs/output.rst index a21b730..a377b16 100644 --- a/docs/output.rst +++ b/docs/output.rst @@ -294,7 +294,7 @@ publications .. _efo: efo - Publication part representing publication EFO terms. Structure same as in pmid_, except content_ is replaced with "list" and size_ is number of elements in "list". + Publication part representing publication EFO and other experimental methods terms. Structure same as in pmid_, except content_ is replaced with "list" and size_ is number of elements in "list". list Array of objects representing publication EFO terms @@ -303,12 +303,8 @@ publications Term name count Number of times the term was mined from full text by :ref:`Europe PMC ` - altNames - Array of strings representing alternative names for the term - dbName - Database name (e.g., ``efo``, ``GO``) - dbIds - Array of strings representing term IDs in the database + uri + Unique URI to the ontology term .. _go: go diff --git a/docs/scraping.rst b/docs/scraping.rst index 56a3c04..426b2dd 100644 --- a/docs/scraping.rst +++ b/docs/scraping.rst @@ -160,7 +160,7 @@ The test files are in a simplified CSV format. The very first line is always ski One field must be the publication ID (pmid, pmcid or doi), or URL in case of webpages.csv, defining the entry to be fetched. The other fields are mostly numbers specifying the lengths and sizes that the different entry parts must have. Only comparing the sizes of contents (instead of the content itself or instead of using checksums) is rather simplistic, but easy to specify and probably enough for detecting changes in resources that need correcting. What fields (besides the ID) are present in a concrete test depend on what can be obtained from the corresponding resource. -Possible fields for publications are the following: length of publication parts :ref:`pmid `, :ref:`pmcid `, :ref:`doi `, :ref:`title `, :ref:`theAbstract ` and :ref:`fulltext `; size (i.e., number of keywords) of publication parts :ref:`keywords `, :ref:`mesh `, :ref:`efo ` and :ref:`go `; length of the entire :ref:`correspAuthor ` string (containing all corresponding authors separated by ";") and length of the :ref:`journalTitle `; number of :ref:`visitedSites `; value of the string :ref:`pubDate `; value of the Boolean :ref:`oa ` (``1`` for ``true`` and ``0`` for ``false``). Every field is a number, except :ref:`pubDate ` where the actual date string must be specified (e.g., ``2018-08-24``). Also, in the tests, the number of :ref:`visitedSites ` is not the actual number of sites visited, but the number of links that were found on the tested page and added manually to the publication by the test routine. For webpages.csv, the fields (beside the ID/URL) are the following: length of the :ref:`webpage title `, the :ref:`webpage content `, the :ref:`software license ` name and length of the :ref:`programming language ` name. +Possible fields for publications are the following: length of publication parts :ref:`pmid `, :ref:`pmcid `, :ref:`doi `, :ref:`title `, :ref:`theAbstract ` and :ref:`fulltext `; size (i.e., number of keywords) of publication parts :ref:`keywords `, :ref:`mesh `, :ref:`efo ` and :ref:`go `; length of the entire :ref:`correspAuthor ` string (containing all corresponding authors separated by ";") and length of the :ref:`journalTitle `; number of :ref:`visitedSites `; value of the string :ref:`pubDate `; value of the Boolean :ref:`oa ` (``1`` for ``true`` and ``0`` for ``false``). Every field is a number, except :ref:`pubDate ` where the actual date string must be specified (e.g., ``2018-08-24``). Also, in the tests, the number of :ref:`visitedSites ` is not the actual number of sites visited, but the number of links that were found on the tested page and added manually to the publication by the test routine. For webpages.csv, the fields (beside the ID/URL) are the following: length of the :ref:`webpage title `, the :ref:`webpage content `, the :ref:`software license ` name and length of the :ref:`programming language ` name. The progress of running tests of a CSV is logged. If all tests pass, then the very last log message will be "OK". Otherwise, the last message will be the number of mismatches, i.e. number of times an actual value was not equal to the value in the corresponding field of a test. The concrete failed tests can be found by searching for "ERROR" level messages in the log.