diff --git a/.DS_Store b/.DS_Store index 52dd4d2..7a464d8 100644 Binary files a/.DS_Store and b/.DS_Store differ diff --git a/README.md b/README.md index 3de934f..0b97e48 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,9 @@ # MolPad -An R-Shiny Package for Cluster Co-Expression Analysis in Longitudinal Multi-Omics +An R-Shiny Package for Cluster Co-Expression Analysis in Longitudinal Microbiomics ## Overview -MolPad offers a visualization dashboard tool designed to enhance our understanding of how molecular co-expression works in the context of multi-omics microbiome data. The approach involves using a cluster network to provide an initial overview of relationships across multiple omics, with the added functionality to interactively zoom in on specific areas of interest. To facilitate this analysis, we've developed a focus-plus-context strategy that seamlessly connects to online curated annotations. +MolPad offers a visualization dashboard tool designed to enhance our understanding of how molecular co-expression works in the context of microbiome data. The approach involves using a cluster network to provide an initial overview of relationships across multiple omics, with the added functionality to interactively zoom in on specific areas of interest. To facilitate this analysis, we've developed a focus-plus-context strategy that seamlessly connects to online curated annotations. Additionally, our package simplifies the entire pipeline for creating the dashboard. This user-friendly design makes it accessible even to students with limited R programming experience. diff --git a/Screen Shot 2023-11-20 at 3.01.57 AM.png b/Screen Shot 2023-11-20 at 3.01.57 AM.png new file mode 100644 index 0000000..58ace63 Binary files /dev/null and b/Screen Shot 2023-11-20 at 3.01.57 AM.png differ diff --git a/index.md b/index.md index 1ee69c1..0530d04 100644 --- a/index.md +++ b/index.md @@ -1,10 +1,10 @@ -MolPad: An R-Shiny Package for Cluster Co-Expression Analysis in Longitudinal Multi-Omics +MolPad: An R-Shiny Package for Cluster Co-Expression Analysis in Longitudinal Microbiomics ================ 2023-05-31 ## Overview -MolPad offers a visualization dashboard tool designed to enhance our understanding of how molecular co-expression works in the context of multi-omics microbiome data. The approach involves using a cluster network to provide an initial overview of relationships across multiple omics, with the added functionality to interactively zoom in on specific areas of interest. To facilitate this analysis, we've developed a focus-plus-context strategy that seamlessly connects to online curated annotations. +MolPad offers a visualization dashboard tool designed to enhance our understanding of how molecular co-expression works in the context of microbiome data. The approach involves using a cluster network to provide an initial overview of relationships across multiple omics, with the added functionality to interactively zoom in on specific areas of interest. To facilitate this analysis, we've developed a focus-plus-context strategy that seamlessly connects to online curated annotations. The dashboard itself comprises several components, including a cluster-level network, a bar plot illustrating taxonomic composition, a line plot displaying data modalities, and a table for each pathway. You can see an illustration of these features in this screenshot. diff --git a/man/figures/.DS_Store b/man/figures/.DS_Store index 4c858b5..a4d756f 100644 Binary files a/man/figures/.DS_Store and b/man/figures/.DS_Store differ diff --git a/man/figures/dashboard.png b/man/figures/dashboard.png index ccdbea7..d9a048b 100644 Binary files a/man/figures/dashboard.png and b/man/figures/dashboard.png differ diff --git a/paper/cheesecase.png b/paper/cheesecase.png index d91f265..84398f4 100644 Binary files a/paper/cheesecase.png and b/paper/cheesecase.png differ diff --git a/paper/dashboard.png b/paper/dashboard.png index ccdbea7..d9a048b 100644 Binary files a/paper/dashboard.png and b/paper/dashboard.png differ diff --git a/paper/paper.bib b/paper/paper.bib index 603c297..39b17ef 100644 --- a/paper/paper.bib +++ b/paper/paper.bib @@ -26,7 +26,80 @@ @article{doi:10.1128/msystems.00701-22 pages = {e00701-22}, year = {2023}, doi = {10.1128/msystems.00701-22}, - URL = {https://journals.asm.org/doi/abs/10.1128/msystems.00701-22}, eprint = {https://journals.asm.org/doi/pdf/10.1128/msystems.00701-22}} -} \ No newline at end of file +} + +@article{BOKULICH20204048, +title = {Measuring the microbiome: Best practices for developing and benchmarking microbiomics methods}, +journal = {Computational and Structural Biotechnology Journal}, +volume = {18}, +pages = {4048-4062}, +year = {2020}, +issn = {2001-0370}, +doi = {https://doi.org/10.1016/j.csbj.2020.11.049}, +url = {https://www.sciencedirect.com/science/article/pii/S200103702030516X}, +author = {Nicholas A. Bokulich and Michal Ziemski and Michael S. Robeson and Benjamin D. Kaehler}, +keywords = {Microbiome, Benchmarking, Software development, Best practices, Metagenomics, Amplicon sequencing, Marker-gene sequencing}, +abstract = {Microbiomes are integral components of diverse ecosystems, and increasingly recognized for their roles in the health of humans, animals, plants, and other hosts. Given their complexity (both in composition and function), the effective study of microbiomes (microbiomics) relies on the development, optimization, and validation of computational methods for analyzing microbial datasets, such as from marker-gene (e.g., 16S rRNA gene) and metagenome data. This review describes best practices for benchmarking and implementing computational methods (and software) for studying microbiomes, with particular focus on unique characteristics of microbiomes and microbiomics data that should be taken into account when designing and testing microbiomics methods.} +} + +@article{corel:hal-01300043, + TITLE = {{Network-Thinking: Graphs to Analyze Microbial Complexity and Evolution}}, + AUTHOR = {Corel, Eduardo and Lopez, Philippe and M{\'e}heust, Rapha{\"e}l and Bapteste, Eric}, + URL = {https://hal.sorbonne-universite.fr/hal-01300043}, + JOURNAL = {{Trends in Microbiology}}, + PUBLISHER = {{Elsevier}}, + VOLUME = {24}, + NUMBER = {3}, + PAGES = {224-237}, + YEAR = {2016}, + MONTH = Mar, + DOI = {10.1016/j.tim.2015.12.003}, + KEYWORDS = {introgression ; gene transfer ; graph theory ; bipartite graph ; symbiosis ; evolution}, + PDF = {https://hal.sorbonne-universite.fr/hal-01300043/file/Network-Thinking.pdf}, + HAL_ID = {hal-01300043}, + HAL_VERSION = {v1}, +} + +@article{https://doi.org/10.1002/pro.3711, +author = {Kanehisa, Minoru and Sato, Yoko}, +title = {KEGG Mapper for inferring cellular functions from protein sequences}, +journal = {Protein Science}, +volume = {29}, +number = {1}, +pages = {28-35}, +keywords = {genome annotation, KEGG, KEGG Mapper, KEGG module, KEGG Orthology, pathway analysis}, +doi = {https://doi.org/10.1002/pro.3711}, +url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/pro.3711}, +eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/pro.3711}, +abstract = {Abstract KEGG is a reference knowledge base for biological interpretation of large-scale molecular datasets, such as genome and metagenome sequences. It accumulates experimental knowledge about high-level functions of the cell and the organism represented in terms of KEGG molecular networks, including KEGG pathway maps, BRITE hierarchies, and KEGG modules. By the process called KEGG mapping, a set of protein coding genes in the genome, for example, can be converted to KEGG molecular networks enabling interpretation of cellular functions and other high-level features. Here we report a new version of KEGG Mapper, a suite of KEGG mapping tools available at the KEGG website (https://www.kegg.jp/ or https://www.genome.jp/kegg/), together with the KOALA family tools for automatic assignment of KO (KEGG Orthology) identifiers used in the mapping.}, +year = {2020} +} + +@INPROCEEDINGS{6094057, + author={Fernstad, Sara Johansson and Johansson, Jimmy and Adams, Suzi and Shaw, Jane and Taylor, David}, + booktitle={2011 IEEE Symposium on Biological Data Visualization (BioVis).}, + title={Visual exploration of microbial populations}, + year={2011}, + volume={}, + number={}, + pages={127-134}, + doi={10.1109/BioVis.2011.6094057}} + + +@article{GENIE3, + doi = {10.1371/journal.pone.0012776}, + author = {Huynh-Thu, Vân Anh AND Irrthum, Alexandre AND Wehenkel, Louis AND Geurts, Pierre}, + journal = {PLOS ONE}, + publisher = {Public Library of Science}, + title = {Inferring Regulatory Networks from Expression Data Using Tree-Based Methods}, + year = {2010}, + month = {09}, + volume = {5}, + url = {https://doi.org/10.1371/journal.pone.0012776}, + pages = {1-10}, + abstract = {One of the pressing open problems of computational systems biology is the elucidation of the topology of genetic regulatory networks (GRNs) using high throughput genomic data, in particular microarray gene expression data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge aims to evaluate the success of GRN inference algorithms on benchmarks of simulated data. In this article, we present GENIE3, a new algorithm for the inference of GRNs that was best performer in the DREAM4 In Silico Multifactorial challenge. GENIE3 decomposes the prediction of a regulatory network between p genes into p different regression problems. In each of the regression problems, the expression pattern of one of the genes (target gene) is predicted from the expression patterns of all the other genes (input genes), using tree-based ensemble methods Random Forests or Extra-Trees. The importance of an input gene in the prediction of the target gene expression pattern is taken as an indication of a putative regulatory link. Putative regulatory links are then aggregated over all genes to provide a ranking of interactions from which the whole network is reconstructed. In addition to performing well on the DREAM4 In Silico Multifactorial challenge simulated data, we show that GENIE3 compares favorably with existing algorithms to decipher the genetic regulatory network of Escherichia coli. It doesn't make any assumption about the nature of gene regulation, can deal with combinatorial and non-linear interactions, produces directed GRNs, and is fast and scalable. In conclusion, we propose a new algorithm for GRN inference that performs well on both synthetic and real gene expression data. The algorithm, based on feature selection with tree-based ensemble methods, is simple and generic, making it adaptable to other types of genomic data and interactions.}, + number = {9}, + +} diff --git a/paper/paper.md b/paper/paper.md index ea101af..9f5134a 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -38,13 +38,16 @@ The R-Shiny package MolPad provides an interactive dashboard for understanding t # Statement of need -The realm of microbiomics is expanding rapidly, with numerous new studies and methodologies emerging. This highlights the need for visualizations that can account for differences across modalities. It’s also important to enable interpretations of dynamics and network structure because these have specific meanings in the genomic context. Another issue is the annotation. The special modality characteristic of microbiomics determines that each identical feature can be classified with various taxons and could have several IDs in different databases. Therefore, although the annotation is available online, it can be tedious to search for parts manually. Moreover, most present visualizations poorly evaluate longitudinal change across microbiomes. In longitudinal data, we need to gain insight into the functioning of how individual features change and how they may influence related features. Thus, it depends upon analysis within one table and across tables. All of these have posed a challenge for unified visualization and interpretation. +The realm of microbiomics is expanding rapidly, with numerous new studies and methodologies emerging [@BOKULICH20204048]. This highlights the need for visual exploration tools that can account for interaction across biological modalities [@6094057]. It’s also important to enable interpretations of dynamics and network structure because these have specific meanings in the genomic context [@corel:hal-01300043]. Another issue is the annotation. The special modality characteristic of microbiomics determines that each identical feature can be classified with various taxons and could have several IDs in different databases [@https://doi.org/10.1002/pro.3711]. Although the annotation is available online, it can be tedious to search for parts manually. Moreover, most present visualizations poorly evaluate longitudinal change across microbiomes. In longitudinal data, we need to gain insight into the functioning of how individual features change and how they may influence related features. Thus, it depends upon analysis within one table and across tables. All of these have posed a challenge for unified visualization and interpretation. In response to the above issues, previous studies on interactive visualization tools have designed methods to work on such data. `microViz` [@microviz] provides a Shiny app for interactive exploration by pairing ordination plots and composition circular bar charts to show each taxon's prevalence and abundance. `GWENA` [@Lemoine_Scott-Boyer_Ambroise_Périn_Droit_2021] applies a network in conducting gene co‑expression analysis and extended module characterization in a single package to understand the underlying processes contributing to a disease or a phenotype. `NeVOmics` [@Zúñiga-León_Carrasco-Navarro_Fierro_2018] improved compatibility with a dynamic dashboard and facilitated the functional characterization of data from omics technologies. It also integrates Over-representation analysis methodology and network-based visualization to show the enrichment results. These methods suggest the mechanisms that improve the utility of microbiomics visualization tools under analysis. # Methods -To depict the longitudinal changes, we first scale and cluster trajectories across all molecular features and then reorganize the clusters into a network graph. We use K-means and a built-in elbow method to choose the optimal number of clusters. Then, we take the centers of each cluster and run a random forest regression for each centroid with all the other centroids as predictors. We pick the top n predictors to build a cluster network with the Mean Decrease Accuracy as the feature importance. Based on the random forest prediction, if two groups of features are highly linked according to the network, they will have strongly related longitudinal patterns, as shown in Fig \ref{fig:pattern}. +We first scale and cluster the trajectories across all molecular features to depict the longitudinal changes. For clustering, we use K-means and a built-in elbow method to choose the optimal number. Then, we predict a co-expression network for the extracted patterns, similar to what GENIE3 [@GENIE3] does to create a genetic regulatory network. We also divide the prediction process into individual regression tasks. Each central pattern of a cluster is predicted from the expression patterns of all the other central patterns, using tree-based ensemble methods Random Forests. It is chosen because of its potential to deal with interacting features and non-linearity without making any extra assumptions. The Mean Decrease Accuracy of a subset of top predictors whose expression directly influences the expression of the target cluster is taken as an indication of a putative link. That is to say, based on the random forest prediction, if two groups of features are highly linked according to the network, they will have strongly related longitudinal patterns, as shown in Fig \ref{fig:pattern}. + +Navigating the network in the MolPad dashboard follows three steps: First, choose a primary functional annotation. Adjustment options for fine-tuning include network layout and importance threshold for edge density. Nodes that turn bright green (Fig \ref{fig:dashboard}.A) represent clusters containing the most features in the chosen functional annotation. Second, brushing on the network reveals patterns of taxonomic composition (Fig \ref{fig:dashboard}.B) and typical trajectories (Fig \ref{fig:dashboard}.C). The user can also zoom into specific taxonomic annotations by filtering. Third, view the feature table (Fig \ref{fig:dashboard}.D) , examine the drop-down options for other related function annotations, and click the link for online information on the interested items. The interface is designed to support iterative exploration, encouraging the use of several steps to answer specific questions, like comparing the pattern distribution between two functions or finding functionally important community members metabolizing a feature of interest. Overall, this aggregation adopted the focus-plus-context approach to address the low interoperability of the network graph, facilitating the examination of high-level details for individual features while providing contextual information about cluster interactions among microbiome data. + # Case Study: Cheese Data @@ -62,7 +65,7 @@ The source code for `MolPad` is stored on [Github](https://github.com/KaiyanM/Mo ![Dashboard Overview: `A`: cluster-level network, `B`: taxonomic-level bar plot, `C`: a type-level line plot, and `D`: a feature-level table. \label{fig:dashboard}](dashboard.png) -![Example of discovering related patterns with network plot. For `a`, the two linked nodes are in the dashed box and have a closer inverse pattern than the other. For `b`, these groups are both less volatile on average and have similar inverse patterns.\label{fig:pattern}](pattern.png){ width=60% } +![Example of discovering related patterns with network plot. For Groups 1, 7, and 8, the patterns are w-shape with an evident peak at the same time section. For Groups 1 and 2, although Group 1 has higher volatility, they both follow highly overlapped increasing trends.\label{fig:pattern}](pattern.png){ width=60% } ![Dashboard showing Groups 10, 7, 4, and 3 for the bacterial (a.) and Group 4 for the eukaryotic (b.) community. Groups 10 and 4 have decreasing trends for both cheeses, and they all include largely Proteobacteria and Firmicutes. While Groups 3 and 7 have the opposite increasing trends, which include more Actinobacteria and Bacteroidetes. Among these, Groups 7 and 4 have the strongest periodicity, suggesting a more reproducible tendency for the corresponding main components. For the eukaryote community, most of the features followed the same stable pattern as in Group 4. \label{fig:cheesecase}](cheesecase.png){ width=80% } diff --git a/paper/pattern.png b/paper/pattern.png index 7522159..128c157 100644 Binary files a/paper/pattern.png and b/paper/pattern.png differ