Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates from gaf #86

Merged
merged 20 commits into from
Nov 14, 2024
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ sha2 = "0.10.8"
tempfile = "3.12.0"
tempdir = "0.3.7"
interavl = "0.2.0"
regex = "1.11.1"

[dev-dependencies]
cargo-llvm-cov = "0.6.13"
2 changes: 1 addition & 1 deletion docs/iterative_changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,4 +73,4 @@ contained a heterozygous insertion, it would create an ambiguity in positions do
these changes are not permitted. However, this format is very amenable for simpler organisms such as e. coli.

For changes where positions are ambiguous, the following approaches may be taken to model changes:
* links here
* [Updating sequences using alignments](updates_with_gaf.md)
96 changes: 96 additions & 0 deletions docs/updates_with_gaf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Updates with alignments

Suppose we are working with a sample where we are unable to provide an unambiguous coordinate into a chromosome. This
is a common scenario in non-haploid organisms with tracked heterozygous changes (for example, suppose after each round
of engineering, observed sequencing variants are incorporated into the graph). For these scenarios, updates need to
be made with respect to node ids in the graph. This is a cumbersome process to identify nodes and determine positions,
and is an area we are constantly trying to improve. One manner to circumvent this is to use alignments and the output
of graph aligners to localize changes. The following example walks through updating a sample where after each round
of engineering, background changes are recorded into the graph.

Our sample in this case is called `K562`, a cell line. We wish to insert a gene into chromsome 22. To do this, we will
take the region flanking the left and right side of the insertion point, and use a graph aligner to identify the nodes
in our graph corresponding to this region. Then, the provided insert will be placed in between the bounds of that region.

For this, we accept a csv file format which gen will autoformat into a fasta file for each update we wish to make.

example.csv
```
id,left,sequence,right
simple_insert,atgggagtataattttagatagtgaagatttctgtattcaaatgccacat,atcgatcgatcggagaga,acacagaaaaaggcaggcagagaaaataacaaggataaagacactgaagt
```

The id field is optional. If provided, an [accession](path_accessions.md) will be created with the id given.

NOTE: This is not implemented currently, as accessions may need to change to reference a block_group instead of a path.
Because the above mechanism doesn't actually have a path. It could be we do not have this behavior for this update style
or change accessions.

```console
gen transform --format-csv-for-gaf example.csv > example.fa
```

This creates a fasta file suitable for alignment with graph aligners such as Minigraph, GraphAligner, VG, etc. The goal is
to obtain a GAF file.

This file will appear as such:

example.fa
```
>simple_insert_left
atgggagtataattttagatagtgaagatttctgtattcaaatgccacat
>simple_insert_right
acacagaaaaaggcaggcagagaaaataacaaggataaagacactgaagt
```

This file can then be fed to an aligner. Here are several examples:

First, we need to export to a common graph format, GFA

```console
gen --db example.db export -n k562 -g k562.gfa
Chris7 marked this conversation as resolved.
Show resolved Hide resolved
```

Then we can use this for various programs:

VG

Note, vg will convert node ids and the gaf must be processed so node ids are restored to their original id listed
in translations.txt
```console
vg convert -g k562.gfa -p -T translations.txt > k562.vg
vg index -g k562.gcsa -x k562.xg k562.vg
vg map --gaf -d k562 -F example.fa > example.gaf
```

GraphAligner
```console
GraphAligner -g k562.gfa -f example.fa -a aln.gaf -x vg
```

Minigraph

For minigraph, we use -w 3 to capture our shorter sequences.
```console
minigraph -c -x sr k562.gfa example.fa -w 3
```

The outputs of these programs will be a gaf file, which for the above example will be similar to this:

example.gaf
```text
simple_insert_left 50 0 49 + >44.0 50 0 49 49 49 28 tp:A:P NM:i:0 cm:i:15 s1:i:49 s2:i:0 dv:f:0 cg:Z:49= ds:Z::49
simple_insert_right 50 1 50 + >46.0 50 1 50 49 49 28 tp:A:P NM:i:0 cm:i:14 s1:i:49 s2:i:0 dv:f:0 cg:Z:49= ds:Z::49
```

This indicates the left arm starts at node `44.0` in our graph and aligns for 50 bases. And the right arm begins at node `46.0`.

This GAF is then fed into gen via:

```console
gen --db example.db update-gaf -n k562 -f example.fa -g example.gaf
```

This will create a new entry containing our insertion sequence starting at the end of the left homology arm and before the
right arm. If there are misalignments in the homology arm, those regions will not be used. For example, suppose the CIGAR string
was 45=4X, the insertion would begin after the 45th base pair. If more fine tuning is desired, the CIGAR string can be manually modified.
4 changes: 4 additions & 0 deletions fixtures/chr22_het.fa
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
>test_left
atgggagtataattttagatagtgaagatttctgtattcaaatgccacat
>test_right
acacagaaaaaggcaggcagagaaaataacaaggataaagacactgaagt
7 changes: 7 additions & 0 deletions fixtures/chr22_het.gaf
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
test_left 50 0 50 + >138.0 50 0 50 50 50 31 tp:A:P NM:i:0 cm:i:30 s1:i:50 s2:i:0 dv:f:0 cg:Z:50= ds:Z::50
test_right 50 0 50 + >140.0 50 0 50 50 50 31 tp:A:P NM:i:0 cm:i:30 s1:i:50 s2:i:0 dv:f:0 cg:Z:50= ds:Z::50
2node_span_left 85 0 85 + >454.0>455.0 100 0 79 79 85 60 tp:A:P NM:i:6 cm:i:20 s1:i:79 s2:i:0 dv:f:0.0561 cg:Z:22=6I57= ds:Z::22+aaaaa[a]:57
2node_span_right 89 0 89 + >530.0>531.0 100 11 100 89 89 60 tp:A:P NM:i:0 cm:i:49 s1:i:90 s2:i:0 dv:f:0.0163 cg:Z:89= ds:Z::89
left_extreme_right 50 0 50 + >3.0 50 0 50 50 50 31 tp:A:P NM:i:0 cm:i:30 s1:i:50 s2:i:0 dv:f:0 cg:Z:50= ds:Z::50
right_extreme_left 50 0 50 + >1001.0 50 0 50 50 50 31 tp:A:P NM:i:0 cm:i:30 s1:i:50 s2:i:0 dv:f:0 cg:Z:50= ds:Z::50
nonsense_entry 50 1 50 + >140.0 50 1 50 49 49 28 tp:A:P NM:i:0 cm:i:14 s1:i:49 s2:i:0 dv:f:0 cg:Z:49= ds:Z::49
Loading