Related Works

Mis-annotations in the public databases

TODO: what is the latest update on this?

Strategies to improve usability and preserve accuracy in biological sequence databases

"we propose five strategies to address fundamental issues in the annotation of sequence databases: (i) to clearly separate experimentally verified and unverified sequence entries; (ii) to enable a system for tracing the origins of annotations; (iii) to separate entries with high‐quality, informative annotation from less useful ones; (iv) to integrate automated quality‐control software whenever such tools exist; and (v) to facilitate postsubmission editing of annotations and metadata associated with sequences. "
TODO: citations

ISU PHD thesis' 13 on misannotations

Discovering meaning from biological sequences: focus on predicting misannotated proteins, binding patterns, and G4-quadruplex secondary
Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach

Biological Databases and Protein Sequence Analysis

"The problem in the reliability of the data is the possibility of misannotations. The misannotations are some time introduced due to the process of automation of annotation process which are carried out extensively with the help of computers. Misannotations, if introduced, multiplies in subsequent additions and may accumulate to an unbelievable extent and create confusion. A possible solution to prevent this from happening is to flag the protein sequence which has been annotated by sequence comparison but whose function has not been validated by experimental methods. "

Automated Detection of Records in Biological Sequence Databases that are Inconsistent with the Literature

Then, we use these quality indicators to train an anomaly detection algorithm to classify records as “confident” or “suspicious”.

Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies

Data cleaning\curation

Ontology-Based Data Cleaning

Broadly speaking, a data cleaning process is composed of a detection step in which the existing conflicts are identified, and a resolution step in which the inconsistencies are solved.
Data Cleaning: Problems and Existing Solutions: conflicts in data model, etc
difference btw equality vs similarity

KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing

Dataset:
implementation link?
contributions:

"Table pattern definition and discovery."

experiments?:

On expert curation and scalability: UniProtKB/Swiss-Prot as a case study

With the assistance of the PubTator text-mining tool, we tagged more than 10 000 articles to assess the ratio of papers relevant for curation.

Metaxa2 Database Builder: enabling taxonomic identification from metagenomic or metabarcoding data using any genetic marker

"correct taxonomic identification of DNA sequences is central to studies of biodiversity using both shotgun metagenomic and metabarcoding approaches. "

GROOLS: reactive graph reasoning for genome annotation through biological processes

Despite improvements in bioinformatics methods, millions of sequences in databanks are not assigned reliable functions.
The curation of protein functions in the context of biological processes is a way to evaluate and improve their annotation.

Data validation\database integration

Ontologies:

BioPortal: http://bioportal.bioontology.org/

Ontology-based validation and identification of regulatory phenotypes

On2Vec: Embedding-based Relation Prediction for Ontology Population

This paper proposes a greatly improved translation-based graph embedding method that helps ontology population by way of relation prediction.
Extensive experiments on four data sets show promising capability of On2Vec on predicting and verifying relation facts.
TODO: Energy function

Detection of Relation Assertion Errors in Knowledge Graphs

Enhanced ontology-based indexing and searching

Ontology-based validation and identification of regulatory phenotypes

implementation: https://github.com/bio-ontology-research-group/phenogocon

ONTOFUSION: ontology-based integration of genomic and clinical databases.

GROOLS: reactive graph reasoning for genome annotation through biological processes

Problem: Despite improvements in bioinformatics methods, millions of sequences in databanks are not assigned reliable functions.

MOEKA - Memoranda in Ontological Engineering and Knowledge Analytics

Python scripts to find enrichment of GO terms

Large-scale reasoning over functions in biomedical ontologies

Use Apache Jena
TODO: reproduce the results: https://github.com/bio-ontology-research-group/uniprot2owl
"Expressing functions in biomedical ontologies currently uses formal representation patterns that renders basic reasoning tasks to fall in complexity classes beyond polynomial time, thereby limiting the potential of using knowledge-based methods for data integration, querying or quality control."
future works?
limitations?

Inferring ontology graph structures using OWL reasoning

TOOD: look at citations
Github repo: https://github.com/bio-ontology-research-group/Onto2Graph
Ontology is a DAG
application: semantic similarity
"While the taxonomy of an ontology can be inferred directly from the axioms of an ontology as one of the standard OWL reasoning tasks, creating general graph structures from OWL ontologies that exploit the ontologies’ semantic content remains a challenge."
Ontologies are widely applied in biology and biomedicine for annotation and integration of data [3].
OWL is a formal language based on Description Logics [8] and offers a formal, model-theoretic semantics.

OWL-NETS: Transforming OWL Representations for Improved Network Inference

create an Abstract network from OWL
implement in python 2.7
use for modularization or partitioning of Ontology

Ontological Pathfinding(OP)

website link
Repository
contributions: "We propose the Ontological Pathfinding algorithm (OP) that scales to web-scale knowledge bases via a series of parallelization and optimization techniques: a relational knowledge base model to apply inference rules in batches, a new rule mining algorithm that parallelizes the join queries, a novel partitioning algorithm to break the mining tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. "
rule mining on top of Spark
Our primary focus of this paper is scalable mining.
we model a knowledge base as a collection of (subject, predicate, object) facts Γ = {(s, p, o)}
- Each predicate specifies relationships among its subjects and objects. We call each relationship a fact. For example, the fact (Barack Obama, wasBornIn, USA) specifies that Barack Obama was born in the USA.
Limitations?

inference rules in batches using join queries

Future works?

node2vec, word embedding, sequence embedding

Graph embedding techniques, applications, and performance: A survey

challenges of embedding: scalability, choice of dimensionality, and features to be preserved
Github repo: https://github.com/hamid58b/GEM
One application of graph analysis is clustering

Sequence Embedding for Clustering and Classification

https://github.com/cran2367/sgt

Learned protein embeddings for machine learning

TODO: related works on pubmed: https://www.ncbi.nlm.nih.gov/pubmed/29584811

Learning protein sequence embeddings using information from structure

Source code: https://github.com/tbepler/protein-sequence-embedding-iclr2019

MetaMLP: A fast word embedding based classifier to profile target gene databases in metagenomic samples

Source code: https://bitbucket.org/gaarangoa/metamlp/src/master/src/

Translate words to Ontology

Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs

Source code: https://bitbucket.org/dimkart/ms-lstm/src/master/

Linking entities through an ontology using word embeddings and syntactic re-ranking, BMC Bioinformatics 2019

unsupervised learning
Linking each identified entity mention in text to an ontology/dictionary concept is an essential task to make sense of the identified entities.
This paper presents an unsupervised approach for the linking of named entities to concepts in an ontology/dictionary.
"Word embedding models, which learn distributed representations ofwords from large unlabeled corpora, are promising approaches for capturing seman- tic information [34].""

related works:

[3, 8, 35, 42] applying word embedding in biomedical domain

ezTag: tagging biomedical concepts via interactive learning

Online link: https://eztag.bioqrator.org/
"To support annotating a wide variety of biological concepts with or without pre-existing training data, we developed ezTag, a web-based annotation tool that allows curators to perform annotation and provide training data with humans in the loop. ezTag supports both abstracts in PubMed and full-text articles in PubMed Central.""

PubTator central: automated concept annotation for biomedical full text articles

PubTator Central (https://www.ncbi.nlm.nih.gov/research/pubtator/) is a web service for viewing and retrieving bioconcept annotations in full text biomedical articles.

Thinking, Fast and Slow:Combining Vector Spaces and Knowledge Graphs

A query processing engine that decomposes an input query to search, list, and infer on the Vectorized Knowledge Graph VKG structure

Short Text Similarity with Word Embeddings

Onto2Vec: joint vector-based representation of biological entities and their ontology-based annotations

implementation: https://github.com/bio-ontology-research-group/opa2vec/

OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction

nod2vec

list of projects: http://snap.stanford.edu/projects.html
Deep Learning for Network Biology
weblog: node2vec: Embeddings for Graph Data
awesome-network-embedding
Python 3 implementations: https://github.com/eliorc/node2vec ✔️

Network embedding in biomedical data science

BioKEEN: a library for learning and evaluating biological knowledge graph embeddings

Implementation : https://github.com/SmartDataAnalytics/PyKEEN

Text similarity

Text Similarities : Estimate the degree of similarity between two texts

Review Papers:

RDF Data Storage and Query Processing Schemes: A Survey

we focus on data storage techniques, indexing strategies, and query execution mechanisms.
5.2 Hadoop-Based RDF Systems

Rule Induction and Reasoning over Knowledge Graphs

" For rules learned from incomplete (Knowledge Graphs)KGs, confidence and other measures may be misleading, as they do not reflect the patterns in the missing facts. This might lead to the extraction of erroneous rules from incomplete and biased KGs."

Knowledge Graph Refinement:A Survey of Approaches and Evaluation Methods

Modularization

OWL Reasoning: Subsumption Test Hardness and Modularity

Modular reuse of ontologies: Theory and practice

"Such techniques enable the identification of a module of an ontology, i.e., a subset of the ontology’s axioms which then can be used for querying and classifying in, usually, less time than the time that would be required for querying or classifying the whole ontology."
limitations of this techniques: modularization techniques that reduce the signature do not enable queries of the whole ontology;

UniPort

provide RDF as a representation format as well as a public SPARQL endpoint to facilitate querying.
Automatic annotations by UniProt
- UniRule
- SAAS

Tools and implementations

GOATOOLS: A Python library for Gene Ontology analyses

[Github repo](GOATOOLS: A Python library for Gene Ontology analyses)
Process the obo-formatted file from Gene Ontology website. The data structure is a directed acyclic graph (DAG) that allows easy traversal from leaf to root.

Apache Jena https://jena.apache.org/index.html

Q: Could it be run on top of Hadoop?
[link](A: https://jena.apache.org/documentation/hadoop/)
https://jena.apache.org/documentation/ontology/
An Introduction to RDF and the Jena RDF API
Apache Jena - Examples

Neo4j for graph Processing

SPARQL: RDF Query language.

Implementation: Semantic Search Based on Domain Ontology Using Apache Spark and Jena:

link

Implementation by MapReduce and Spark:

RDF Data Storage Techniques for Efficient SPARQL Query Processing Using Distributed Computation Engines

Hadoop has been adopted for RDF data management systems.
we introduce distributed RDF data stores, namely VPExp and 3CStore, based on the existing vertical partitioning (VP) approach

Scalable OWL Ontology Reasoning using Cloud Computing

Scalable visualization for DBpedia ontology analysis using Hadoop

OWL Reasoning Framework over Big Biological Knowledge Network

SPOWL: Spark-based OWL 2 Reasoning Materialisation

Scalable distributed reasoning using MapReduce

Incremental reasoning

An Incremental Reasoning Algorithm for Large Scale Knowledge Graph

we propose an incremental reasoning algorithm which can effectively avoid re-reasoning over the entire knowledge graph

Ontology reasoner

elk: https://github.com/liveontologies/elk-reasoner

Clustering

Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases

CD-HIT clustering
TODO: ** re-implement this approach: What would be the computational and heuristic approach to increase this work**
Future work?
ISU Biological Ontologies

https://digital.ag.iastate.edu/isu-biological-ontologies-fall-seminar-series-september-17

Biomolecular Data Resources: Bioinformatics Infrastructure for Biomedical Data Science

Applying Ontology
Update ontologies

PHD thesis' 14

Provenance, propagation and quality of biological annotation

Files

related_works.md

Latest commit

History

related_works.md

File metadata and controls

Related Works

Mis-annotations in the public databases

ISU PHD thesis' 13 on misannotations

Data cleaning\curation

Data validation\database integration

Ontologies:

On2Vec: Embedding-based Relation Prediction for Ontology Population

Ontology-based validation and identification of regulatory phenotypes

OWL-NETS: Transforming OWL Representations for Improved Network Inference

node2vec, word embedding, sequence embedding

Translate words to Ontology

related works:

Text similarity

Review Papers:

Rule Induction and Reasoning over Knowledge Graphs

Knowledge Graph Refinement:A Survey of Approaches and Evaluation Methods

Modularization

UniPort

Tools and implementations

GOATOOLS: A Python library for Gene Ontology analyses

Apache Jena https://jena.apache.org/index.html

Neo4j for graph Processing

SPARQL: RDF Query language.

Implementation: Semantic Search Based on Domain Ontology Using Apache Spark and Jena:

Implementation by MapReduce and Spark:

Incremental reasoning

Ontology reasoner

Clustering

PHD thesis' 14