From 47190f73813c08519eca0e3ad1ac32f79f225814 Mon Sep 17 00:00:00 2001 From: Luiz Irber Date: Tue, 1 Sep 2020 17:53:33 -0700 Subject: [PATCH] upd --- thesis/00-intro.Rmd | 24 +++++---- thesis/01-scaled.Rmd | 2 - thesis/02-index.Rmd | 118 ++++++++++++++++++++++++++---------------- thesis/bib/thesis.bib | 15 ++++++ 4 files changed, 100 insertions(+), 59 deletions(-) diff --git a/thesis/00-intro.Rmd b/thesis/00-intro.Rmd index ec52305..45f2c45 100644 --- a/thesis/00-intro.Rmd +++ b/thesis/00-intro.Rmd @@ -12,7 +12,7 @@ the requirements for having indexes with sizes of the same magnitude of the orig For example, NCBI provides BLAST search as a service on their website, but it uses specially prepared databases with a subset of the data stored in GenBank or similar databases. -While NCBI does offer a similar service for each dataset in the SRA (sequence read archive), +While NCBI does offer a similar service for each dataset in the SRA (Sequence Read Archive), there is no service to search across every dataset at once because of its size, which is on the order of petabytes of data and growing exponentially. @@ -28,7 +28,7 @@ k-mers can be hashed and stored in integer datatypes, allowing for fast comparison and many opportunities for compression. Solomon and Kingsford's solution for the problem, the Sequence Bloom Tree, -use these properties to define and store the k-mer composition of a dataset in a Bloom Filter [@bloom_spacetime_1970], +uses these properties to define and store the k-mer composition of a dataset in a Bloom Filter [@bloom_spacetime_1970], a probabilistic data structure that allows insertion and checking if a value might be present. Bloom Filters can be tuned to reach a predefined false positive bound, trading off memory for accuracy. @@ -50,17 +50,18 @@ The downside is the false positive increase, especially if both original filters are already reaching saturation. To account for that, Bloom Filters in a SBT need to be initialized with a size proportional to the cardinality of the combined datasets, -which can be quite large for big collections. +which can be quite large for large collections. Since Bloom Filters only generate false positives, and not false negatives, -in the worst case there is degradation of the computational performance, -(because more internal nodes need to be checked), +in the worst case there is degradation of the computational performance because more internal nodes need to be checked, but the final results are unchanged. While Bloom Filters can be used to calculate similarity of dataset, -there are data structures more efficient probabilistic data structures for this use case. -A MinHash sketch [@broder_resemblance_1997] is a representation of a dataset that allows estimating the Jaccard similarity of the original dataset without requiring the original data to be available. -The Jaccard similarity of two datasets is the size of the intersection of elements in both datasets divided by the size of the union of elements in both datasets. +there are more efficient probabilistic data structures for this use case. +A MinHash sketch [@broder_resemblance_1997] is a representation of a dataset allowing +estimation of the Jaccard similarity between dataset without requiring the original data to be available. +The Jaccard similarity of two datasets is the size of the intersection of elements in both datasets divided by the size of the union of elements in both datasets: +$J(A, B)=\frac{\vert A \cup B \vert}{\vert A \cap B \vert}$. The MinHash sketch uses a subset of the original data as a proxy for the data -- in this case, hashing each element and taking the smallest values for each dataset. Broder defines two approaches for taking the smallest values: @@ -74,7 +75,8 @@ The ModHash approach also allows calculating the containment of two datasets, how much of a dataset is present in another. It is defined as the size of the intersection divided by the size of the dataset, and so is asymmetrical -(unlike the Jaccard similarity). +(unlike the Jaccard similarity): +$C(A, B)=\frac{\vert A \cup B \vert}{\vert A \vert}$. While the MinHash can also calculate containment, if the datasets are of distinct cardinalities the errors accumulate quickly. This is relevant for biological use cases, @@ -117,7 +119,7 @@ and a new approach for containment estimation using Scaled MinHash sketches. **Chapter 2** describes indexing methods for sketches, focusing on a hierarchical approach optimized for storage access and low memory consumption (`MHBT`) -and a fast inverted index optimized for fast retrieval but with larger memory consumption (`Revindex`). +and a fast inverted index optimized for fast retrieval but with larger memory consumption (`LCA index`). It also introduces `sourmash`, a software implementing these indices and optimized Scaled MinHash sketches, as well as extended functionality for iterative and exploratory biological data analysis. @@ -128,7 +130,7 @@ Comparisons with current taxonomic profiling methods using community-developed b assessments show that `gather` paired with taxonomic information outperforms other approaches, using a fraction of the computational resources and allowing analysis in platforms accessible to regular users (like laptops). -**Chapter 4** describes wort, +**Chapter 4** describes `wort`, a framework for distributed signature calculation, including discussions about performance and cost trade-offs for sketching public genomic databases, as well as distributed systems architectures allowing large scale processing of petabytes of data. diff --git a/thesis/01-scaled.Rmd b/thesis/01-scaled.Rmd index de1276e..957b6d0 100644 --- a/thesis/01-scaled.Rmd +++ b/thesis/01-scaled.Rmd @@ -8,8 +8,6 @@ The {#rmd-basics} text after the chapter declaration will allow us to link throu ## Introduction -... - -Searching for matches in large collection of datasets is challenging when hundreds of thousands of -them are available, +Searching for matches in large collection of datasets is challenging when hundreds of thousands of them are available, especially if they are partitioned and the data is not all present at the same place, or too large to even be stored in a single system. + +Efficient methods for sequencing datasets use exact $k$-mer matching instead of relying on sequence alignment, +but sensitivity is reduced since they can't deal with sequencing errors and biological variation as alignment-based methods can. + -Bloofi [@crainiceanu_bloofi:_2015] is a hierarchical index structure that +Bloofi [@crainiceanu_bloofi:_2015] is an example of an hierarchical index structure that extends the Bloom Filter basic query to collections of Bloom Filters. Instead of calculating the union of all Bloom Filters in the collection (which would allow answering if an element is present in any of them) @@ -78,17 +81,18 @@ Bloofi can also be partitioned in a network, with network nodes containing a subtree of the original tree and only being accessed if the search requires it. -The Sequence Bloom Tree [@solomon_fast_2016] adapts Bloofi for genomic contexts, -rephrasing the problem as experiment discovery: +For genomic contexts, +an hierarchical index is a $k$-mer aggregative method, +with datasets represented by the $k$-mer composition of the dataset and stored in a data structure that allows querying for $k$-mer presence. +The Sequence Bloom Tree [@solomon_fast_2016] adapts Bloofi for genomics and rephrasing the search problem as experiment discovery: given a query sequence $Q$ and a threshold $\theta$, which experiments contain at least $\theta$ of the original query $Q$? Experiments are encoded in Bloom Filters containing the $k$-mer composition of transcriptomes, and queries are transcripts. -Further developments focused on clustering similar datasets to prune search +Further developments of the SBT approach focused on clustering similar datasets to prune search early [@sun_allsome_2017] and developing more efficient representations for the -internal nodes [@solomon_improved_2017] [@harris_improved_2018] to use less -storage space and memory. +internal nodes [@solomon_improved_2017] [@harris_improved_2018] to use less storage space and memory. -## Results - -### Index construction and updating - - ### Efficient similarity and containment queries @@ -219,18 +239,18 @@ but it simplifies implementation and provides better correctness guarantees. ### Choosing an index The Linear index is appropriate for operations that must check every signature, -since they don't have any indexing overhead. -An example is building a distance matrix for comparing signatures all-against-all, -but search operations greatly benefit from extra indexing structure. -The MHBT index and $k$-mer aggregative methods in general are appropriate for threshold queries, +since it doesn't have any indexing overhead. +An example is building a distance matrix for comparing signatures all-against-all. +Search operations greatly benefit from extra indexing structure. +The MHBT index and $k$-mer aggregative methods in general are appropriate for searches with query thresholds, like searching for similarity or containment of a query in a collection of datasets. The LCA index and color aggregative methods are appropriate for querying which datasets contain a specific query $k$-mer. As implemented in sourmash, the MHBT index is more memory efficient because the data can stay in external memory and only the tree structure for the index -need to be loaded in memory, +need to be loaded in main memory, and data for the datasets and internal nodes can be loaded and unloaded on demand. -The LCA index must be loaded in memory before it can be used, +The LCA index must be loaded in main memory before it can be used, but once it is loaded it is faster, especially for operations that need to summarize $k$-mer assignments or required repeated searches. @@ -252,6 +272,9 @@ This allows trade-offs between storage efficiency, distribution, updating and query performance. +Because both are able to return the original sketch collection, +it is also possible to convert one index into the other. + ### Limitations and future directions ## Conclusion diff --git a/thesis/bib/thesis.bib b/thesis/bib/thesis.bib index ab64082..018fdde 100644 --- a/thesis/bib/thesis.bib +++ b/thesis/bib/thesis.bib @@ -3131,3 +3131,18 @@ @article{li_minimap2_2018 date = {2018}, note = {Publisher: Oxford University Press}, } + +@online{noauthor_p1185-zhupdf_nodate, + title = {p1185-zhu.pdf}, + url = {http://www.vldb.org/pvldb/vol9/p1185-zhu.pdf}, + urldate = {2020-07-20}, +} + +@inproceedings{pandey_general-purpose_2017, + title = {A general-purpose counting filter: Making every bit count}, + shorttitle = {A general-purpose counting filter}, + pages = {775--787}, + booktitle = {Proceedings of the 2017 {ACM} international conference on Management of Data}, + author = {Pandey, Prashant and Bender, Michael A. and Johnson, Rob and Patro, Rob}, + date = {2017}, +}