-
Notifications
You must be signed in to change notification settings - Fork 19
Home
Welcome to the CEVOpen wiki! This page outlines the key components of the project. It's intentionally kept short. If you wish to know more, you can browse through the wiki pages of this repository, and of openvirus (https://github.com/petermr/openVirus/wiki). Please feel free to address any of our questions to us!
-
pygetpapers
is the scraper developed in Python by Ayush Garg. It is based ongetpapers
(https://github.com/ContentMine/getpapers) which was written in Node.js.pygetpapers
downloads scientific papers, primarily from EuropePMC repository. You can read more about it here -
pyami
(Needs more documentation. Still a prototype) is currently being developed by Peter Murray-Rust. The software annotates the paper, updates dictionaries and much more. It is central to our workflow.
Source code can be found here
Currently, our projects are based on building dictionaries. Each intern has their own dictionary which is usually relevant to essential oils. The current list is:
- Radhu - oil-producing plants
- Radhu - biological activities of EOs
- Kanishka - Invasive plant species
- Talha - EO compounds
- Vasant - Plant parts
- Shweata - Plant Genera
- Shweata - organizations (e.g. Research Funders, Universities)
- Countries - Ambreen
- Most dictionaries are created from Wikidata SPARQL queries. You can take a look at individual dictionary wiki pages to know more.
- You can also refer to this slide deck to understand the basics.
- chemotype
- genotype
- activities (medicinal)
- phenotype - invasive species integration - how these fit together - an atlas
Python is essential to run all of our software. Ensure you've installed it before proceeding further.
2.1.1. pygetpapers
(https://github.com/petermr/pygetpapers)
Run the following command on your command line to install pygetpapers
pip install git+git://github.com/petermr/pygetpapers
If you have trouble installing using this method, you can find alternatives here.
2.1.2. ami_gui.py
git clone https://github.com/petermr/openDiagram.git
- Though
ami_gui.py
runs on the command line, you will have to make some changes to the source code to point the software to where all the projects outlined below lie on your local machine. PyCharm is recommended to edit the source code.
The project has gradually expanded and branched out to different research areas. Therefore, our work is dispersed across various different repositories. These repositories are where the latest dictionaries, mini-corpora and software are. To run amigui_py
, you will have to clone (i.e., download it to your local machine) the following repositories:
-
openVirus
(https://github.com/petermr/openVirus.git) -
dictionary
(https://github.com/petermr/dictionary.git) -
CEVOpen
(https://github.com/petermr/CEVOpen.git) -
openDiagram
(https://github.com/petermr/openDiagram.git)
To build a multilingual semantic Atlas of Volatile Phytochemistry.[1]
To build Open Source multiplatform tools which can discover, aggregate, clean, and semantify scholarly documents containing significant amounts of phytochemical VOC[2]s. Documents will contain, extraction and assay of oils, optionally with properties and activities.
Phytochemistry is the key component of this project and in the main, we will be analysing:
- compounds (mainly VOC). Includes synonyms, structures, images
- plants that create VOC/essential oils, again many synonyms, includes images
- locations where the plant was harvested
- activities reported for the oils
- organizations involved
We will be analysing corpora for instances of the above, manually to validate the process and then automatically.
- APIs for repositories such as EPMC, biorXiv preprints, and thesis collections.
- Scrapers for semi-structured sites such as journals
- standardised metadata (e.g. JATS)
- PDF and HTML readers => XML or JSON
- article sectioning (e.g. into JATS categories)
- extraction of floats (tables, maps, images, diagrams, chemistry, maths*)
- display and navigation of sections in a paper
- aggregated statistics and machine learning
- multilingual annotation (using Wikidata)
- linking to the Wikidata knowledge graph
[*] not included in CEVOpen but extensible in future
[1] we need an engaging title. "Atlas" is often extended beyond maps (e.g. Atlas of The Human Body). For example, plantPart is an atlas of the plant. It works for me but may confuse others. Here are some ideas:
- "Compendium of ..."
- "Semantic Essence of phytochemistry". Essence == central meaning, and also volatiles
But please think creatively.
[2] Volatile Organic Compound
- Coordination of EO-related and general dictionaries - conformance to a common standard.
- Validation of gold-standard minicorpora (e.g. for training and validating machine learning)
- If you are interested in contributing to the project on the Machine Learning front, you can take a look at the Our-Project-and-Machine-Learning page.
We've presented our work (mostly of openVirus) at various places including Wikcite, COAR and BarCamp. You can take a look at our Outreach page. If you're a newbie, taking a look at our presentations is, probably, the best way to get started to understand the pipeline.
All the interns, volunteers and contributors should adhere to the code of conduct, outlined here. Basically, it says "be respectable and helpful towards everyone".