Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial copy of relevent crate files #1

Merged
merged 1 commit into from
Jan 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .github/workflows/test-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: Continuous integration
on: [push, pull_request]

jobs:
test:
name: cargo test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
with:
components: clippy
- run: RUST_LOG=trace cargo test --all-features --release
- run: cargo clippy -- -D warnings
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@

# v0.1.0
Initial release - exact copy of HiPhase implementation, but with some slight comment updates
12 changes: 12 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[package]
name = "waffle_graph"
version = "0.1.0"
edition = "2021"

[dependencies]
bit-vec = "0.8.0"
log = "0.4.25"
priority-queue = "2.1.1"
rustc-hash = "2.1.0"
simple-error = "0.3.1"
thiserror = "2.0.11"
50 changes: 50 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
[![Build status](https://github.com/PacificBiosciences/waffle_graph/actions/workflows/test-ci.yml/badge.svg)](https://github.com/PacificBiosciences/waffle_graph/actions)

# waffle_graph
This crate contains functions to align sequences to a graph using a WFA-based (wavefront algorithm) approach.
The approach was designed with genomics in mind, so the terms are genomics-oriented.

Core approach:
* Identify a backbone sequence (e.g., reference genome)
* Identify and define alternate paths (i.e., variants to the genome)
* Build the graph from those definitions
* Align to the graph, noting which paths are traversed on the minimum edit distance path(s)

Performance notes:
* The underlying algorithm scales with WFA, so high error rates will increase compute time and memory consumption
* If variants (alternate paths) are missing from your graph, then the edit distance will be higher and the algorithm slower

Tools that use a form of this:
* [HiPhase](https://github.com/PacificBiosciences/HiPhase) - originally designed for HiPhase; this crate was extracted from the HiPhase code for independent usage
* [pb-StarPhase](https://github.com/PacificBiosciences/pb-StarPhase) - currently reference the HiPhase implementation

## Full documentation
`waffle_graph` provides extensive in-line documentation.
A user-friendly HTML version can be generated via `cargo doc`.

## Methods
At a high level, this project implements a combination of partial-order alignment (POA) with the wavefront algorithm (WFA).
WFA runs best when the edit distance is low, and a graph structure helps reduce the number of edits by adding known variation to the graph.
This particular implementation forces no loop structure in the graph.

This implementation was designed with HiPhase variant identification as the core application.
In this problem, there is a defined backbone (the reference genome) and a set of known deviations from that backbone (variants).
Deviations can be small (single base change) or large "structural variants" (multi-base insertion/deletion).
Adding more variants to the graph tends to improve performance of the algorithm.

## Limitations
`waffle_graph` was designed specifically for variant identification in HiPhase using long, accurate HiFi reads.
The underlying algorithms rely on a basic edit-distance wavefront algorithm, which scales with the total edit distance between two sequences.
Thus, high error or high divergence sequences are more likely to lead to longer run times.
While adding true variants tends to reduce run-time, adding too many variants that are not a part of your sample may bloat the graph and lead to longer run-times.

## Support information
The `waffle_graph` crate is a pre-release software intended for research use **only** and not for use in diagnostic procedures.
While efforts were made to ensure that `waffle_graph` lives up to the quality that PacBio strives for, we make no warranty regarding this software.

As `waffle_graph` is **not** covered by any service level agreement or the like, please do not contact a PacBio Field Applications Scientists or PacBio Customer Service for assistance with any `waffle_graph` release.
Please report all issues through GitHub instead.
We make no warranty that any such issue will be addressed, to any extent or within any time frame.

### DISCLAIMER
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.
3 changes: 3 additions & 0 deletions src/data_types/mod.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@

/// Contains Variant type as well as supporting definitions
pub mod variants;
Loading
Loading