- TODO: what is the latest update on this?
- "we propose five strategies to address fundamental issues in the annotation of sequence databases: (i) to clearly separate experimentally verified and unverified sequence entries; (ii) to enable a system for tracing the origins of annotations; (iii) to separate entries with high‐quality, informative annotation from less useful ones; (iv) to integrate automated quality‐control software whenever such tools exist; and (v) to facilitate postsubmission editing of annotations and metadata associated with sequences. "
- TODO: citations
- "The problem in the reliability of the data is the possibility of misannotations. The misannotations are some time introduced due to the process of automation of annotation process which are carried out extensively with the help of computers. Misannotations, if introduced, multiplies in subsequent additions and may accumulate to an unbelievable extent and create confusion. A possible solution to prevent this from happening is to flag the protein sequence which has been annotated by sequence comparison but whose function has not been validated by experimental methods. "
Automated Detection of Records in Biological Sequence Databases that are Inconsistent with the Literature
- Then, we use these quality indicators to train an anomaly detection algorithm to classify records as “confident” or “suspicious”.
- Broadly speaking, a data cleaning process is composed of a detection step in which the existing conflicts are identified, and a resolution step in which the inconsistencies are solved.
- Data Cleaning: Problems and Existing Solutions: conflicts in data model, etc
- difference btw equality vs similarity
- Dataset:
- implementation link?
- contributions:
- "Table pattern definition and discovery."
- experiments?:
- With the assistance of the PubTator text-mining tool, we tagged more than 10 000 articles to assess the ratio of papers relevant for curation.
Metaxa2 Database Builder: enabling taxonomic identification from metagenomic or metabarcoding data using any genetic marker
- "correct taxonomic identification of DNA sequences is central to studies of biodiversity using both shotgun metagenomic and metabarcoding approaches. "
- Despite improvements in bioinformatics methods, millions of sequences in databanks are not assigned reliable functions.
- The curation of protein functions in the context of biological processes is a way to evaluate and improve their annotation.
- BioPortal: http://bioportal.bioontology.org/
- This paper proposes a greatly improved translation-based graph embedding method that helps ontology population by way of relation prediction.
- Extensive experiments on four data sets show promising capability of On2Vec on predicting and verifying relation facts.
- TODO: Energy function
- implementation: https://github.com/bio-ontology-research-group/phenogocon
- Problem: Despite improvements in bioinformatics methods, millions of sequences in databanks are not assigned reliable functions.
- Use Apache Jena
- TODO: reproduce the results: https://github.com/bio-ontology-research-group/uniprot2owl
- "Expressing functions in biomedical ontologies currently uses formal representation patterns that renders basic reasoning tasks to fall in complexity classes beyond polynomial time, thereby limiting the potential of using knowledge-based methods for data integration, querying or quality control."
- future works?
- limitations?
- TOOD: look at citations
- Github repo: https://github.com/bio-ontology-research-group/Onto2Graph
- Ontology is a DAG
- application: semantic similarity
- "While the taxonomy of an ontology can be inferred directly from the axioms of an ontology as one of the standard OWL reasoning tasks, creating general graph structures from OWL ontologies that exploit the ontologies’ semantic content remains a challenge."
- Ontologies are widely applied in biology and biomedicine for annotation and integration of data [3].
- OWL is a formal language based on Description Logics [8] and offers a formal, model-theoretic semantics.
- create an Abstract network from OWL
- implement in python 2.7
- use for modularization or partitioning of Ontology
- website link
- Repository
- contributions: "We propose the Ontological Pathfinding algorithm (OP) that scales to web-scale knowledge bases via a series of parallelization and optimization techniques: a relational knowledge base model to apply inference rules in batches, a new rule mining algorithm that parallelizes the join queries, a novel partitioning algorithm to break the mining tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. "
- rule mining on top of Spark
- Our primary focus of this paper is scalable mining.
- we model a knowledge base as a collection of (subject, predicate, object) facts Γ = {(s, p, o)}
- Each predicate specifies relationships among its subjects and objects. We call each relationship a fact. For example, the fact (Barack Obama, wasBornIn, USA) specifies that Barack Obama was born in the USA.
- Limitations?
- inference rules in batches using join queries
- Future works?
- challenges of embedding: scalability, choice of dimensionality, and features to be preserved
- Github repo: https://github.com/hamid58b/GEM
- One application of graph analysis is clustering
- TODO: related works on pubmed: https://www.ncbi.nlm.nih.gov/pubmed/29584811
MetaMLP: A fast word embedding based classifier to profile target gene databases in metagenomic samples
- Source code: https://bitbucket.org/dimkart/ms-lstm/src/master/
Linking entities through an ontology using word embeddings and syntactic re-ranking, BMC Bioinformatics 2019
- unsupervised learning
- Linking each identified entity mention in text to an ontology/dictionary concept is an essential task to make sense of the identified entities.
- This paper presents an unsupervised approach for the linking of named entities to concepts in an ontology/dictionary.
- "Word embedding models, which learn distributed representations ofwords from large unlabeled corpora, are promising approaches for capturing seman- tic information [34].""
- [3, 8, 35, 42] applying word embedding in biomedical domain
- Online link: https://eztag.bioqrator.org/
- "To support annotating a wide variety of biological concepts with or without pre-existing training data, we developed ezTag, a web-based annotation tool that allows curators to perform annotation and provide training data with humans in the loop. ezTag supports both abstracts in PubMed and full-text articles in PubMed Central.""
- PubTator Central (https://www.ncbi.nlm.nih.gov/research/pubtator/) is a web service for viewing and retrieving bioconcept annotations in full text biomedical articles.
- A query processing engine that decomposes an input query to search, list, and infer on the Vectorized Knowledge Graph VKG structure
Onto2Vec: joint vector-based representation of biological entities and their ontology-based annotations
- implementation: https://github.com/bio-ontology-research-group/opa2vec/
OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction
- list of projects: http://snap.stanford.edu/projects.html
- Deep Learning for Network Biology
- weblog: node2vec: Embeddings for Graph Data
- awesome-network-embedding
- Python 3 implementations: https://github.com/eliorc/node2vec ✔️
- Implementation : https://github.com/SmartDataAnalytics/PyKEEN
- we focus on data storage techniques, indexing strategies, and query execution mechanisms.
- 5.2 Hadoop-Based RDF Systems
- " For rules learned from incomplete (Knowledge Graphs)KGs, confidence and other measures may be misleading, as they do not reflect the patterns in the missing facts. This might lead to the extraction of erroneous rules from incomplete and biased KGs."
- "Such techniques enable the identification of a module of an ontology, i.e., a subset of the ontology’s axioms which then can be used for querying and classifying in, usually, less time than the time that would be required for querying or classifying the whole ontology."
- limitations of this techniques: modularization techniques that reduce the signature do not enable queries of the whole ontology;
- provide RDF as a representation format as well as a public SPARQL endpoint to facilitate querying.
- Automatic annotations by UniProt
- UniRule
- SAAS
- [Github repo](GOATOOLS: A Python library for Gene Ontology analyses)
- Process the obo-formatted file from Gene Ontology website. The data structure is a directed acyclic graph (DAG) that allows easy traversal from leaf to root.
Apache Jena https://jena.apache.org/index.html
- Q: Could it be run on top of Hadoop?
- [link](A: https://jena.apache.org/documentation/hadoop/)
- https://jena.apache.org/documentation/ontology/
- An Introduction to RDF and the Jena RDF API
- Apache Jena - Examples
RDF Data Storage Techniques for Efficient SPARQL Query Processing Using Distributed Computation Engines
- Hadoop has been adopted for RDF data management systems.
- we introduce distributed RDF data stores, namely VPExp and 3CStore, based on the existing vertical partitioning (VP) approach
- we propose an incremental reasoning algorithm which can effectively avoid re-reasoning over the entire knowledge graph
-
CD-HIT clustering
-
TODO: ** re-implement this approach: What would be the computational and heuristic approach to increase this work**
-
Future work?
-
ISU Biological Ontologies
- Applying Ontology
- Update ontologies