Skip to content

Commit

Permalink
Merge pull request #1 from PacificBiosciences/initial_pr
Browse files Browse the repository at this point in the history
Initial file sync
  • Loading branch information
holtjma authored Sep 10, 2024
2 parents d26b992 + 0c065ba commit 7997a59
Show file tree
Hide file tree
Showing 28 changed files with 4,929 additions and 0 deletions.
14 changes: 14 additions & 0 deletions .github/workflows/test-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: Continuous integration
on: [push, pull_request]

jobs:
test:
name: cargo test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
with:
components: clippy
- run: cargo test --all-features --release
- run: cargo clippy -- -D warnings
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Cargo.lock
target
28 changes: 28 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
[package]
name = "waffle_con"
version = "0.4.2"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
derive_builder = "0.13.0"
env_logger = "0.11.1"
itertools = "0.12.1"
log = "0.4.20"
priority-queue = "1.3.2"
rustc-hash = "1.1.0"
simple-error = "0.3.0"
rand = "0.8.5"

[dev-dependencies]
criterion = "0.5.1"
csv = "1.3.0"
serde = "1.0.197"

[[bench]]
name = "consensus_bench"
harness = false

[profile.release]
lto = true
15 changes: 15 additions & 0 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Pacific Biosciences Software License Agreement
1. **Introduction and Acceptance.** This Software License Agreement (this “**Agreement**”) is a legal agreement between you (either an individual or an entity) and Pacific Biosciences of California, Inc. (“**PacBio**”) regarding the use of the PacBio software accompanying this Agreement, which includes documentation provided in “online” or electronic form (together, the “**Software**”). PACBIO PROVIDES THE SOFTWARE SOLELY ON THE TERMS AND CONDITIONS SET FORTH IN THIS AGREEMENT AND ON THE CONDITION THAT YOU ACCEPT AND COMPLY WITH THEM. BY DOWNLOADING, DISTRIBUTING, MODIFYING OR OTHERWISE USING THE SOFTWARE, YOU (A) ACCEPT THIS AGREEMENT AND AGREE THAT YOU ARE LEGALLY BOUND BY ITS TERMS; AND (B) REPRESENT AND WARRANT THAT: (I) YOU ARE OF LEGAL AGE TO ENTER INTO A BINDING AGREEMENT; AND (II) IF YOU REPRESENT A CORPORATION, GOVERNMENTAL ORGANIZATION OR OTHER LEGAL ENTITY, YOU HAVE THE RIGHT, POWER AND AUTHORITY TO ENTER INTO THIS AGREEMENT ON BEHALF OF SUCH ENTITY AND BIND SUCH ENTITY TO THESE TERMS. IF YOU DO NOT AGREE TO THE TERMS OF THIS AGREEMENT, PACBIO WILL NOT AND DOES NOT LICENSE THE SOFTWARE TO YOU AND YOU MUST NOT DOWNLOAD, INSTALL OR OTHERWISE USE THE SOFTWARE OR DOCUMENTATION.
2. **Grant of License.** Subject to your compliance with the restrictions set forth in this Agreement, PacBio hereby grants to you a non-exclusive, non-transferable license during the Term to install, copy, use, distribute in binary form only, and host the Software. If you received the Software from PacBio in source code format, you may also modify and/or compile the Software.
3. **License Restrictions.** You may not remove or destroy any copyright notices or other proprietary markings. You may only use the Software to process or analyze data generated on a PacBio instrument or otherwise provided to you by PacBio. Any use, modification, translation, or compilation of the Software not expressly authorized in Section 2 is prohibited. You may not use, modify, host, or distribute the Software so that any part of the Software becomes subject to any license that requires, as a condition of use, modification, hosting, or distribution, that (a) the Software, in whole or in part, be disclosed or distributed in source code form or (b) any third party have the right to modify the Software, in whole or in part.
4. **Ownership.** The license granted to you in Section 2 is not a transfer or sale of PacBio’s ownership rights in or to the Software. Except for the license granted in Section 2, PacBio retains all right, title and interest (including all intellectual property rights) in and to the Software. The Software is protected by applicable intellectual property laws, including United States copyright laws and international treaties.
5. **Third Party Materials.** The Software may include software, content, data or other materials, including related documentation and open source software, that are owned by one or more third parties and that are subject to separate licensee terms (“**Third-Party Licenses**”). A list of all materials, if any, can be found the documentation for the Software. You acknowledge and agree that such third party materials subject to Third-Party Licenses are not licensed to you pursuant to the provisions of this Agreement and that this Agreement shall not be construed to grant any such right and/or license. You shall have only such rights and/or licenses, if any, to use such third party materials as set forth in the applicable Third-Party Licenses.
6. **Feedback.** If you provide any feedback to PacBio concerning the functionality and performance of the Software, including identifying potential errors and improvements (“**Feedback**”), such Feedback shall be owned by PacBio. You hereby assign to PacBio all right, title, and interest in and to the Feedback, and PacBio is free to use the Feedback without any payment or restriction.
7. **Confidentiality.** You must hold in the strictest confidence the Software and any related materials or information including, but not limited to, any Feedback, technical data, research, product plans, or know-how provided by PacBio to you, directly or indirectly in writing, orally or by inspection of tangible objects (“**Confidential Information**”). You will not disclose any Confidential Information to third parties, including any of your employees who do not have a need to know such information, and you will take reasonable measures to protect the secrecy of, and to avoid disclosure and unauthorized use of, the Confidential Information. You will immediately notify the PacBio in the event of any unauthorized or suspected use or disclosure of the Confidential Information. To protect the Confidential Information contained in the Software, you may not reverse engineer, decompile, or disassemble the Software, except to the extent the foregoing restriction is expressly prohibited by applicable law.
8. **Termination.** This Agreement will terminate upon the earlier of: (a) your failure to comply with any term of this Agreement; or (b) return, destruction, or deletion of all copies of the Software in your possession. PacBio’s rights and your obligations will survive the termination of this Agreement. The “**Term**” means the period beginning on when this Agreement becomes effective until the termination of this Agreement. Upon termination of this Agreement for any reason, you will delete from all of your computer libraries or storage devices or otherwise destroy all copies of the Software and derivatives thereof.
9. **NO OTHER WARRANTIES.** THE SOFTWARE IS PROVIDED ON AN “AS IS” BASIS. YOU ASSUME ALL RESPONSIBILITIES FOR SELECTION OF THE SOFTWARE TO ACHIEVE YOUR INTENDED RESULTS, AND FOR THE INSTALLATION OF, USE OF, AND RESULTS OBTAINED FROM THE SOFTWARE. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, PACBIO DISCLAIMS ALL WARRANTIES, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, QUALITY, ACCURACY, TITLE, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE WITH RESPECT TO THE SOFTWARE AND THE ACCOMPANYING WRITTEN MATERIALS. THERE IS NO WARRANTY AGAINST INTERFERENCE WITH THE ENJOYMENT OF THE SOFTWARE OR AGAINST INFRINGEMENT. THERE IS NO WARRANTY THAT THE SOFTWARE OR PACBIO’S EFFORTS WILL FULFILL ANY OF YOUR PARTICULAR PURPOSES OR NEEDS.
10. **LIMITATION OF LIABILITY.** UNDER NO CIRCUMSTANCES WILL PACBIO BE LIABLE FOR ANY CONSEQUENTIAL, SPECIAL, INDIRECT, INCIDENTAL OR PUNITIVE DAMAGES WHATSOEVER (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF BUSINESS PROFITS, BUSINESS INTERRUPTION, LOSS OF BUSINESS INFORMATION, LOSS OF DATA OR OTHER SUCH PECUNIARY LOSS) ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE, EVEN IF PACBIO HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. IN NO EVENT WILL PACBIO’S AGGREGATE LIABILITY FOR DAMAGES ARISING OUT OF THIS AGREEMENT EXCEED $5. THE FOREGOING EXCLUSIONS AND LIMITATIONS OF LIABILITY AND DAMAGES WILL NOT APPLY TO CONSEQUENTIAL DAMAGES FOR PERSONAL INJURY.
11. **Indemnification.** You will indemnify, hold harmless, and defend PacBio (including all of its officers, employees, directors, subsidiaries, representatives, affiliates, and agents) and PacBio’s suppliers from and against any damages (including attorney’s fees and expenses), claims, and lawsuits that arise or result from your use of the Software.
12. **Trademarks.** Certain of the product and PacBio names used in this Agreement, the Software may constitute trademarks of PacBio or third parties. You are not authorized to use any such trademarks.
13. **Export Restrictions.** YOU UNDERSTAND AND AGREE THAT THE SOFTWARE IS SUBJECT TO UNITED STATES AND OTHER APPLICABLE EXPORT-RELATED LAWS AND REGULATIONS AND THAT YOU MAY NOT EXPORT, RE-EXPORT OR TRANSFER THE SOFTWARE OR ANY DIRECT PRODUCT OF THE SOFTWARE EXCEPT AS PERMITTED UNDER THOSE LAWS. WITHOUT LIMITING THE FOREGOING, EXPORT, RE-EXPORT, OR TRANSFER OF THE SOFTWARE TO CUBA, IRAN, NORTH KOREA, SYRIA, RUSSIA, BELARUS, AND THE REGIONS OF CRIMEA, LNR, AND DNR OF UKRAINE IS PROHIBITED.
14. **General.** This Agreement is governed by the laws of the State of California, without reference to its conflict of laws principles. This Agreement is the entire agreement between you and PacBio and supersedes any other communications with respect to the Software. If any provision of this Agreement is held invalid or unenforceable, the remainder of this Agreement will continue in full force and effect.
54 changes: 54 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1 +1,55 @@
[![Build status](https://github.com/PacificBiosciences/waffle_con/actions/workflows/test-ci.yml/badge.svg)](https://github.com/PacificBiosciences/waffle_con/actions)


# waffle_con
This crate contains our implementation of the Dynamic WFA (DWFA) consensus algorithms, or `waffle_con`.
The algorithms contained within were built to support the consensus steps of [pb-StarPhase](https://github.com/PacificBiosciences/pb-StarPhase).
Any issues related to StarPhase should be reported on the StarPhase GitHub page.

This crate contains functionality for:

* ConsensusDWFA - One input string per sequence, one output sequence
* DualConsensusDWFA - One input string per sequence, 1 or 2 output sequences
* PriorityConsensuDWFA - Multiple input strings per sequence (priority), 1+ output sequences

## Full documentation
`waffle_con` provides extensive in-line documentation.
A user-friendly HTML version can be generated via `cargo doc`.

## Methods
At a high level, this project provided many consensus algorithms that slowly build upon each other from a baseline single-consensus method.
The single consensus method is based on the idea of cost-based exploration of an assembly (or consensus) space.
"Cost" in this context is basically the edit distance between the assembled sequence and a set of inputs.
We use a dynamic WFA algorithm to both nominate and score the assembled sequences.
These sequences are then explored using a Dijkstra-like approach (least-cost-first).
Thus the core loop is:

* Pop a candidate from the priority queue - check if this candidate is finished and compare to current best results
* If not finished, each dynamic WFA nominates one or more extensions to the candidate
* Each candidate extension is added to the dynamic WFA for each sequence
* The total combined edit distance is used to score that candidate
* The candidate is placed into the min-cost priority queue

The algorithm can be extended to the "dual" option by allowing it to split into two candidate sequences when sufficient evidence is present (e.g., sufficient minor allele frequency and/or number of sequences).
Then the best score is used for each input sequence when calculating the edit distance cost.
This can be further split into a multi-consensus by repeatedly running dual-consensus in a binary-tree like system (e.g., repeatedly split the sequences into groups until they do not want to split further).
Finally, priority consensus is just multi-consensus on a chain of sequence inputs instead of a single sequence input.

## Limitations
`waffle_con` has been designed for the specific purpose of PGx consensus in StarPhase using long, accurate HiFi reads.
The underlying algorithms rely on a basic edit-distance wavefront algorithm, which scales with the total edit distance between two sequences.
Thus, high error or high divergence sequences are more likely to lead to longer run times.
Additionally, high error may cause the traversal algorithm to "get lost" in the search space due to weak consensus, which may ultimatley lead to lower quality consensus sequences.
Additionally, best results are with full-length input sequences.
Partial sequences require estimating start/stop positions, which injects more opportunities for error.

## Support information
The `waffle_con` crate is a pre-release software intended for research use **only** and not for use in diagnostic procedures.
While efforts were made to ensure that `waffle_con` lives up to the quality that PacBio strives for, we make no warranty regarding this software.

As `waffle_con` is **not** covered by any service level agreement or the like, please do not contact a PacBio Field Applications Scientists or PacBio Customer Service for assistance with any `waffle_con` release.
Please report all issues through GitHub instead.
We make no warranty that any such issue will be addressed, to any extent or within any time frame.

### DISCLAIMER
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.
55 changes: 55 additions & 0 deletions benches/consensus_bench.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@

use criterion::{black_box, criterion_group, criterion_main, Criterion};

use waffle_con::cdwfa_config::CdwfaConfigBuilder;
use waffle_con::consensus::ConsensusDWFA;
use waffle_con::example_gen::generate_test;

pub fn bench_consensus(c: &mut Criterion) {
let alphabet_size = 4;
let seq_lens = [1000, 10000];
let num_samples = [8, 30];
let error_rates = [0.0, 0.01, 0.02];

let mut benchmark_group = c.benchmark_group("consensus-group");
benchmark_group.sample_size(10);

for &sl in seq_lens.iter() {
for &ns in num_samples.iter() {
// require 25% of reads to go forth
let config = CdwfaConfigBuilder::default()
.min_count((ns as u64) / 4)
.build().unwrap();
for &er in error_rates.iter() {
let (_consensus, dataset) = generate_test(alphabet_size, sl, ns, er);
/*
// uncomment to print out the strings, mostly for initial testing
println!("{consensus:?}");
for d in dataset.iter() {
println!("{d:?}");
}
*/
let test_label = format!("consensus_{alphabet_size}x{sl}x{ns}_{er}");
benchmark_group.bench_function(&test_label, |b| b.iter(|| {
black_box({
let mut consensus_dwfa = ConsensusDWFA::with_config(config.clone()).unwrap();
for s in dataset.iter() {
consensus_dwfa.add_sequence(s).unwrap();
}
let resolved_consensus = consensus_dwfa.consensus().unwrap();

// this was an initial sanity check we did as a basic test
// assert_eq!(resolved_consensus[0].sequence(), &consensus);

resolved_consensus
});
}));
}
}
}

benchmark_group.finish();
}

criterion_group!(benches, bench_consensus);
criterion_main!(benches);
Loading

0 comments on commit 7997a59

Please sign in to comment.