Skip to content

Commit

Permalink
edit paper and some figures
Browse files Browse the repository at this point in the history
  • Loading branch information
KaiyanM committed Nov 20, 2023
1 parent 4a339b3 commit 46e5aa5
Show file tree
Hide file tree
Showing 11 changed files with 85 additions and 9 deletions.
Binary file modified .DS_Store
Binary file not shown.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# MolPad <img src="https://github.com/KaiyanM/MolPad/blob/main/man/figures/logo.png" align="right" height="130" /></a>
An R-Shiny Package for Cluster Co-Expression Analysis in Longitudinal Multi-Omics
An R-Shiny Package for Cluster Co-Expression Analysis in Longitudinal Microbiomics

## Overview

MolPad offers a visualization dashboard tool designed to enhance our understanding of how molecular co-expression works in the context of multi-omics microbiome data. The approach involves using a cluster network to provide an initial overview of relationships across multiple omics, with the added functionality to interactively zoom in on specific areas of interest. To facilitate this analysis, we've developed a focus-plus-context strategy that seamlessly connects to online curated annotations.
MolPad offers a visualization dashboard tool designed to enhance our understanding of how molecular co-expression works in the context of microbiome data. The approach involves using a cluster network to provide an initial overview of relationships across multiple omics, with the added functionality to interactively zoom in on specific areas of interest. To facilitate this analysis, we've developed a focus-plus-context strategy that seamlessly connects to online curated annotations.

Additionally, our package simplifies the entire pipeline for creating the dashboard. This user-friendly design makes it accessible even to students with limited R programming experience.

Expand Down
Binary file added Screen Shot 2023-11-20 at 3.01.57 AM.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions index.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
MolPad: An R-Shiny Package for Cluster Co-Expression Analysis in Longitudinal Multi-Omics
MolPad: An R-Shiny Package for Cluster Co-Expression Analysis in Longitudinal Microbiomics
================
2023-05-31

## Overview

MolPad offers a visualization dashboard tool designed to enhance our understanding of how molecular co-expression works in the context of multi-omics microbiome data. The approach involves using a cluster network to provide an initial overview of relationships across multiple omics, with the added functionality to interactively zoom in on specific areas of interest. To facilitate this analysis, we've developed a focus-plus-context strategy that seamlessly connects to online curated annotations.
MolPad offers a visualization dashboard tool designed to enhance our understanding of how molecular co-expression works in the context of microbiome data. The approach involves using a cluster network to provide an initial overview of relationships across multiple omics, with the added functionality to interactively zoom in on specific areas of interest. To facilitate this analysis, we've developed a focus-plus-context strategy that seamlessly connects to online curated annotations.

The dashboard itself comprises several components, including a cluster-level network, a bar plot illustrating taxonomic composition, a line plot displaying data modalities, and a table for each pathway. You can see an illustration of these features in this screenshot.

Expand Down
Binary file modified man/figures/.DS_Store
Binary file not shown.
Binary file modified man/figures/dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified paper/cheesecase.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified paper/dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
77 changes: 75 additions & 2 deletions paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,80 @@ @article{doi:10.1128/msystems.00701-22
pages = {e00701-22},
year = {2023},
doi = {10.1128/msystems.00701-22},

URL = {https://journals.asm.org/doi/abs/10.1128/msystems.00701-22},
eprint = {https://journals.asm.org/doi/pdf/10.1128/msystems.00701-22}}
}
}
@article{BOKULICH20204048,
title = {Measuring the microbiome: Best practices for developing and benchmarking microbiomics methods},
journal = {Computational and Structural Biotechnology Journal},
volume = {18},
pages = {4048-4062},
year = {2020},
issn = {2001-0370},
doi = {https://doi.org/10.1016/j.csbj.2020.11.049},
url = {https://www.sciencedirect.com/science/article/pii/S200103702030516X},
author = {Nicholas A. Bokulich and Michal Ziemski and Michael S. Robeson and Benjamin D. Kaehler},
keywords = {Microbiome, Benchmarking, Software development, Best practices, Metagenomics, Amplicon sequencing, Marker-gene sequencing},
abstract = {Microbiomes are integral components of diverse ecosystems, and increasingly recognized for their roles in the health of humans, animals, plants, and other hosts. Given their complexity (both in composition and function), the effective study of microbiomes (microbiomics) relies on the development, optimization, and validation of computational methods for analyzing microbial datasets, such as from marker-gene (e.g., 16S rRNA gene) and metagenome data. This review describes best practices for benchmarking and implementing computational methods (and software) for studying microbiomes, with particular focus on unique characteristics of microbiomes and microbiomics data that should be taken into account when designing and testing microbiomics methods.}
}

@article{corel:hal-01300043,
TITLE = {{Network-Thinking: Graphs to Analyze Microbial Complexity and Evolution}},
AUTHOR = {Corel, Eduardo and Lopez, Philippe and M{\'e}heust, Rapha{\"e}l and Bapteste, Eric},
URL = {https://hal.sorbonne-universite.fr/hal-01300043},
JOURNAL = {{Trends in Microbiology}},
PUBLISHER = {{Elsevier}},
VOLUME = {24},
NUMBER = {3},
PAGES = {224-237},
YEAR = {2016},
MONTH = Mar,
DOI = {10.1016/j.tim.2015.12.003},
KEYWORDS = {introgression ; gene transfer ; graph theory ; bipartite graph ; symbiosis ; evolution},
PDF = {https://hal.sorbonne-universite.fr/hal-01300043/file/Network-Thinking.pdf},
HAL_ID = {hal-01300043},
HAL_VERSION = {v1},
}

@article{https://doi.org/10.1002/pro.3711,
author = {Kanehisa, Minoru and Sato, Yoko},
title = {KEGG Mapper for inferring cellular functions from protein sequences},
journal = {Protein Science},
volume = {29},
number = {1},
pages = {28-35},
keywords = {genome annotation, KEGG, KEGG Mapper, KEGG module, KEGG Orthology, pathway analysis},
doi = {https://doi.org/10.1002/pro.3711},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/pro.3711},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/pro.3711},
abstract = {Abstract KEGG is a reference knowledge base for biological interpretation of large-scale molecular datasets, such as genome and metagenome sequences. It accumulates experimental knowledge about high-level functions of the cell and the organism represented in terms of KEGG molecular networks, including KEGG pathway maps, BRITE hierarchies, and KEGG modules. By the process called KEGG mapping, a set of protein coding genes in the genome, for example, can be converted to KEGG molecular networks enabling interpretation of cellular functions and other high-level features. Here we report a new version of KEGG Mapper, a suite of KEGG mapping tools available at the KEGG website (https://www.kegg.jp/ or https://www.genome.jp/kegg/), together with the KOALA family tools for automatic assignment of KO (KEGG Orthology) identifiers used in the mapping.},
year = {2020}
}

@INPROCEEDINGS{6094057,
author={Fernstad, Sara Johansson and Johansson, Jimmy and Adams, Suzi and Shaw, Jane and Taylor, David},
booktitle={2011 IEEE Symposium on Biological Data Visualization (BioVis).},
title={Visual exploration of microbial populations},
year={2011},
volume={},
number={},
pages={127-134},
doi={10.1109/BioVis.2011.6094057}}


@article{GENIE3,
doi = {10.1371/journal.pone.0012776},
author = {Huynh-Thu, Vân Anh AND Irrthum, Alexandre AND Wehenkel, Louis AND Geurts, Pierre},
journal = {PLOS ONE},
publisher = {Public Library of Science},
title = {Inferring Regulatory Networks from Expression Data Using Tree-Based Methods},
year = {2010},
month = {09},
volume = {5},
url = {https://doi.org/10.1371/journal.pone.0012776},
pages = {1-10},
abstract = {One of the pressing open problems of computational systems biology is the elucidation of the topology of genetic regulatory networks (GRNs) using high throughput genomic data, in particular microarray gene expression data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge aims to evaluate the success of GRN inference algorithms on benchmarks of simulated data. In this article, we present GENIE3, a new algorithm for the inference of GRNs that was best performer in the DREAM4 In Silico Multifactorial challenge. GENIE3 decomposes the prediction of a regulatory network between p genes into p different regression problems. In each of the regression problems, the expression pattern of one of the genes (target gene) is predicted from the expression patterns of all the other genes (input genes), using tree-based ensemble methods Random Forests or Extra-Trees. The importance of an input gene in the prediction of the target gene expression pattern is taken as an indication of a putative regulatory link. Putative regulatory links are then aggregated over all genes to provide a ranking of interactions from which the whole network is reconstructed. In addition to performing well on the DREAM4 In Silico Multifactorial challenge simulated data, we show that GENIE3 compares favorably with existing algorithms to decipher the genetic regulatory network of Escherichia coli. It doesn't make any assumption about the nature of gene regulation, can deal with combinatorial and non-linear interactions, produces directed GRNs, and is fast and scalable. In conclusion, we propose a new algorithm for GRN inference that performs well on both synthetic and real gene expression data. The algorithm, based on feature selection with tree-based ensemble methods, is simple and generic, making it adaptable to other types of genomic data and interactions.},
number = {9},

}
9 changes: 6 additions & 3 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,13 +38,16 @@ The R-Shiny package MolPad provides an interactive dashboard for understanding t

# Statement of need

The realm of microbiomics is expanding rapidly, with numerous new studies and methodologies emerging. This highlights the need for visualizations that can account for differences across modalities. It’s also important to enable interpretations of dynamics and network structure because these have specific meanings in the genomic context. Another issue is the annotation. The special modality characteristic of microbiomics determines that each identical feature can be classified with various taxons and could have several IDs in different databases. Therefore, although the annotation is available online, it can be tedious to search for parts manually. Moreover, most present visualizations poorly evaluate longitudinal change across microbiomes. In longitudinal data, we need to gain insight into the functioning of how individual features change and how they may influence related features. Thus, it depends upon analysis within one table and across tables. All of these have posed a challenge for unified visualization and interpretation.
The realm of microbiomics is expanding rapidly, with numerous new studies and methodologies emerging [@BOKULICH20204048]. This highlights the need for visual exploration tools that can account for interaction across biological modalities [@6094057]. It’s also important to enable interpretations of dynamics and network structure because these have specific meanings in the genomic context [@corel:hal-01300043]. Another issue is the annotation. The special modality characteristic of microbiomics determines that each identical feature can be classified with various taxons and could have several IDs in different databases [@https://doi.org/10.1002/pro.3711]. Although the annotation is available online, it can be tedious to search for parts manually. Moreover, most present visualizations poorly evaluate longitudinal change across microbiomes. In longitudinal data, we need to gain insight into the functioning of how individual features change and how they may influence related features. Thus, it depends upon analysis within one table and across tables. All of these have posed a challenge for unified visualization and interpretation.

In response to the above issues, previous studies on interactive visualization tools have designed methods to work on such data. `microViz` [@microviz] provides a Shiny app for interactive exploration by pairing ordination plots and composition circular bar charts to show each taxon's prevalence and abundance. `GWENA` [@Lemoine_Scott-Boyer_Ambroise_Périn_Droit_2021] applies a network in conducting gene co‑expression analysis and extended module characterization in a single package to understand the underlying processes contributing to a disease or a phenotype. `NeVOmics` [@Zúñiga-León_Carrasco-Navarro_Fierro_2018] improved compatibility with a dynamic dashboard and facilitated the functional characterization of data from omics technologies. It also integrates Over-representation analysis methodology and network-based visualization to show the enrichment results. These methods suggest the mechanisms that improve the utility of microbiomics visualization tools under analysis.

# Methods

To depict the longitudinal changes, we first scale and cluster trajectories across all molecular features and then reorganize the clusters into a network graph. We use K-means and a built-in elbow method to choose the optimal number of clusters. Then, we take the centers of each cluster and run a random forest regression for each centroid with all the other centroids as predictors. We pick the top n predictors to build a cluster network with the Mean Decrease Accuracy as the feature importance. Based on the random forest prediction, if two groups of features are highly linked according to the network, they will have strongly related longitudinal patterns, as shown in Fig \ref{fig:pattern}.
We first scale and cluster the trajectories across all molecular features to depict the longitudinal changes. For clustering, we use K-means and a built-in elbow method to choose the optimal number. Then, we predict a co-expression network for the extracted patterns, similar to what GENIE3 [@GENIE3] does to create a genetic regulatory network. We also divide the prediction process into individual regression tasks. Each central pattern of a cluster is predicted from the expression patterns of all the other central patterns, using tree-based ensemble methods Random Forests. It is chosen because of its potential to deal with interacting features and non-linearity without making any extra assumptions. The Mean Decrease Accuracy of a subset of top predictors whose expression directly influences the expression of the target cluster is taken as an indication of a putative link. That is to say, based on the random forest prediction, if two groups of features are highly linked according to the network, they will have strongly related longitudinal patterns, as shown in Fig \ref{fig:pattern}.

Navigating the network in the MolPad dashboard follows three steps: First, choose a primary functional annotation. Adjustment options for fine-tuning include network layout and importance threshold for edge density. Nodes that turn bright green (Fig \ref{fig:dashboard}.A) represent clusters containing the most features in the chosen functional annotation. Second, brushing on the network reveals patterns of taxonomic composition (Fig \ref{fig:dashboard}.B) and typical trajectories (Fig \ref{fig:dashboard}.C). The user can also zoom into specific taxonomic annotations by filtering. Third, view the feature table (Fig \ref{fig:dashboard}.D) , examine the drop-down options for other related function annotations, and click the link for online information on the interested items. The interface is designed to support iterative exploration, encouraging the use of several steps to answer specific questions, like comparing the pattern distribution between two functions or finding functionally important community members metabolizing a feature of interest. Overall, this aggregation adopted the focus-plus-context approach to address the low interoperability of the network graph, facilitating the examination of high-level details for individual features while providing contextual information about cluster interactions among microbiome data.


# Case Study: Cheese Data

Expand All @@ -62,7 +65,7 @@ The source code for `MolPad` is stored on [Github](https://github.com/KaiyanM/Mo

![Dashboard Overview: `A`: cluster-level network, `B`: taxonomic-level bar plot, `C`: a type-level line plot, and `D`: a feature-level table. \label{fig:dashboard}](dashboard.png)

![Example of discovering related patterns with network plot. For `a`, the two linked nodes are in the dashed box and have a closer inverse pattern than the other. For `b`, these groups are both less volatile on average and have similar inverse patterns.\label{fig:pattern}](pattern.png){ width=60% }
![Example of discovering related patterns with network plot. For Groups 1, 7, and 8, the patterns are w-shape with an evident peak at the same time section. For Groups 1 and 2, although Group 1 has higher volatility, they both follow highly overlapped increasing trends.\label{fig:pattern}](pattern.png){ width=60% }

![Dashboard showing Groups 10, 7, 4, and 3 for the bacterial (a.) and Group 4 for the eukaryotic (b.) community. Groups 10 and 4 have decreasing trends for both cheeses, and they all include largely Proteobacteria and Firmicutes. While Groups 3 and 7 have the opposite increasing trends, which include more Actinobacteria and Bacteroidetes. Among these, Groups 7 and 4 have the strongest periodicity, suggesting a more reproducible tendency for the corresponding main components. For the eukaryote community, most of the features followed the same stable pattern as in Group 4. \label{fig:cheesecase}](cheesecase.png){ width=80% }

Expand Down
Binary file modified paper/pattern.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 46e5aa5

Please sign in to comment.