Skip to content

Latest commit

 

History

History
316 lines (199 loc) · 18.1 KB

related_works.md

File metadata and controls

316 lines (199 loc) · 18.1 KB

Related Works

Mis-annotations in the public databases

  • TODO: what is the latest update on this?
  • "we propose five strategies to address fundamental issues in the annotation of sequence databases: (i) to clearly separate experimentally verified and unverified sequence entries; (ii) to enable a system for tracing the origins of annotations; (iii) to separate entries with high‐quality, informative annotation from less useful ones; (iv) to integrate automated quality‐control software whenever such tools exist; and (v) to facilitate postsubmission editing of annotations and metadata associated with sequences. "
  • TODO: citations
ISU PHD thesis' 13 on misannotations
  • "The problem in the reliability of the data is the possibility of misannotations. The misannotations are some time introduced due to the process of automation of annotation process which are carried out extensively with the help of computers. Misannotations, if introduced, multiplies in subsequent additions and may accumulate to an unbelievable extent and create confusion. A possible solution to prevent this from happening is to flag the protein sequence which has been annotated by sequence comparison but whose function has not been validated by experimental methods. "
  • Then, we use these quality indicators to train an anomaly detection algorithm to classify records as “confident” or “suspicious”.

Data cleaning\curation

  • Broadly speaking, a data cleaning process is composed of a detection step in which the existing conflicts are identified, and a resolution step in which the inconsistencies are solved.
  • Data Cleaning: Problems and Existing Solutions: conflicts in data model, etc
  • difference btw equality vs similarity
  • Dataset:
  • implementation link?
  • contributions:
  • "Table pattern definition and discovery."
  • experiments?:
  • With the assistance of the PubTator text-mining tool, we tagged more than 10 000 articles to assess the ratio of papers relevant for curation.
  • "correct taxonomic identification of DNA sequences is central to studies of biodiversity using both shotgun metagenomic and metabarcoding approaches. "
  • Despite improvements in bioinformatics methods, millions of sequences in databanks are not assigned reliable functions.
  • The curation of protein functions in the context of biological processes is a way to evaluate and improve their annotation.

Data validation\database integration

Ontologies:
On2Vec: Embedding-based Relation Prediction for Ontology Population
  • This paper proposes a greatly improved translation-based graph embedding method that helps ontology population by way of relation prediction.
  • Extensive experiments on four data sets show promising capability of On2Vec on predicting and verifying relation facts.
  • TODO: Energy function
Ontology-based validation and identification of regulatory phenotypes
  • Problem: Despite improvements in bioinformatics methods, millions of sequences in databanks are not assigned reliable functions.
  • Use Apache Jena
  • TODO: reproduce the results: https://github.com/bio-ontology-research-group/uniprot2owl
  • "Expressing functions in biomedical ontologies currently uses formal representation patterns that renders basic reasoning tasks to fall in complexity classes beyond polynomial time, thereby limiting the potential of using knowledge-based methods for data integration, querying or quality control."
  • future works?
  • limitations?
  • TOOD: look at citations
  • Github repo: https://github.com/bio-ontology-research-group/Onto2Graph
  • Ontology is a DAG
  • application: semantic similarity
  • "While the taxonomy of an ontology can be inferred directly from the axioms of an ontology as one of the standard OWL reasoning tasks, creating general graph structures from OWL ontologies that exploit the ontologies’ semantic content remains a challenge."
  • Ontologies are widely applied in biology and biomedicine for annotation and integration of data [3].
  • OWL is a formal language based on Description Logics [8] and offers a formal, model-theoretic semantics.
OWL-NETS: Transforming OWL Representations for Improved Network Inference
  • create an Abstract network from OWL
  • implement in python 2.7
  • use for modularization or partitioning of Ontology
  • website link
  • Repository
  • contributions: "We propose the Ontological Pathfinding algorithm (OP) that scales to web-scale knowledge bases via a series of parallelization and optimization techniques: a relational knowledge base model to apply inference rules in batches, a new rule mining algorithm that parallelizes the join queries, a novel partitioning algorithm to break the mining tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. "
  • rule mining on top of Spark
  • Our primary focus of this paper is scalable mining.
  • we model a knowledge base as a collection of (subject, predicate, object) facts Γ = {(s, p, o)}
    • Each predicate specifies relationships among its subjects and objects. We call each relationship a fact. For example, the fact (Barack Obama, wasBornIn, USA) specifies that Barack Obama was born in the USA.
  • Limitations?
  • inference rules in batches using join queries
  • Future works?

node2vec, word embedding, sequence embedding

  • challenges of embedding: scalability, choice of dimensionality, and features to be preserved
  • Github repo: https://github.com/hamid58b/GEM
  • One application of graph analysis is clustering

Translate words to Ontology

  • unsupervised learning
  • Linking each identified entity mention in text to an ontology/dictionary concept is an essential task to make sense of the identified entities.
  • This paper presents an unsupervised approach for the linking of named entities to concepts in an ontology/dictionary.
  • "Word embedding models, which learn distributed representations ofwords from large unlabeled corpora, are promising approaches for capturing seman- tic information [34].""
related works:
  • [3, 8, 35, 42] applying word embedding in biomedical domain
  • Online link: https://eztag.bioqrator.org/
  • "To support annotating a wide variety of biological concepts with or without pre-existing training data, we developed ezTag, a web-based annotation tool that allows curators to perform annotation and provide training data with humans in the loop. ezTag supports both abstracts in PubMed and full-text articles in PubMed Central.""
  • A query processing engine that decomposes an input query to search, list, and infer on the Vectorized Knowledge Graph VKG structure

Text similarity

Review Papers:

  • we focus on data storage techniques, indexing strategies, and query execution mechanisms.
  • 5.2 Hadoop-Based RDF Systems
Rule Induction and Reasoning over Knowledge Graphs
  • " For rules learned from incomplete (Knowledge Graphs)KGs, confidence and other measures may be misleading, as they do not reflect the patterns in the missing facts. This might lead to the extraction of erroneous rules from incomplete and biased KGs."
Knowledge Graph Refinement:A Survey of Approaches and Evaluation Methods

Modularization

  • "Such techniques enable the identification of a module of an ontology, i.e., a subset of the ontology’s axioms which then can be used for querying and classifying in, usually, less time than the time that would be required for querying or classifying the whole ontology."
  • limitations of this techniques: modularization techniques that reduce the signature do not enable queries of the whole ontology;
UniPort

Tools and implementations

GOATOOLS: A Python library for Gene Ontology analyses
  • [Github repo](GOATOOLS: A Python library for Gene Ontology analyses)
  • Process the obo-formatted file from Gene Ontology website. The data structure is a directed acyclic graph (DAG) that allows easy traversal from leaf to root.
Neo4j for graph Processing
SPARQL: RDF Query language.
Implementation: Semantic Search Based on Domain Ontology Using Apache Spark and Jena:

Implementation by MapReduce and Spark:

  • Hadoop has been adopted for RDF data management systems.
  • we introduce distributed RDF data stores, namely VPExp and 3CStore, based on the existing vertical partitioning (VP) approach

Incremental reasoning

  • we propose an incremental reasoning algorithm which can effectively avoid re-reasoning over the entire knowledge graph

Ontology reasoner

Clustering

  • CD-HIT clustering

  • TODO: ** re-implement this approach: What would be the computational and heuristic approach to increase this work**

  • Future work?

  • ISU Biological Ontologies

  • Applying Ontology
  • Update ontologies
PHD thesis' 14