phylogenetic build fails because of missing nextalign #2

genehack · 2024-05-14T00:34:27Z

Current Behavior

# from repo root
nextstrain build ./ingest
# lots of output, things work

nextstrain build ./phylogenetic
# lots of output, things don't work; first error: 

/bin/bash: line 1: nextalign: command not found

Possible solution

Based on the archived repo, nextalign was moved into nextclade -- but the page linked for nextalign-cli 404s.

I'm guessing the right answer here is to update the Snakemake file to either replace the nextalign call with nextclade run with some set of options, or (looking at the zika repo) covert things over to using augur for the alignment?

@kimandrews any insight you can provide would be appreciated!

The text was updated successfully, but these errors were encountered:

kimandrews · 2024-05-14T03:47:48Z

I used augur align for whole genome alignment in the measles phylogenetic workflow, whereas I used nextclade run for aligning the shorter N450 region

victorlin · 2024-05-14T16:50:13Z

The context here is that nextalign was bundled with Nextclade in v2 and removed in v3, which is probably the version you have. This pathogen repo seems to be written for Nextclade v2 though that dependency isn't stated anywhere.

I'm assuming nextalign was chosen for this repo intentionally, so potential fixes would be to (1) mention the Nextclade v2 dependency explicitly and set your environment up with that or (2) migrate to Nextclade v3 by using nextclade run as you've mentioned. (2) is probably the best move.

joverlee521 · 2024-05-14T17:04:14Z

Ah, the workflow was created before Nextclade v3 was released.

I think we'd want to migrate it to nextclade3 run following Nextclade's migration guide.

genehack · 2024-05-14T17:40:44Z

I think we'd want to migrate it to nextclade3 run following Nextclade's migration guide.

Thanks! I'll check out that guide.

* `output.insertions` will be a TSV file now * `--reference` is now spelled `--input-ref` * `--genemap` is now spelled `--input-annotation` * `--retry-reverse-complement` is no longer supported * `--output-insertions` is now spelled `--output-tsv` Note: dropping `--retry-reverse-complement` is the one that I am most unsure about, but this version completes this step.

Initially, the workflow failed with the following error: ``` Error: 0: When reading genome annotation 1: When reading file: "config/hku1/genemap.gff" 2: Attempted to parse the genome annotation as JSON and as GFF, but both attempts failed: JSON error: invalid type: string "NC_006577.2\tfeature\tsource\t1\t29926\t.\t+\t.\tgene=nuc NC_006577.2\tfeature\tgene\t206\t13600 \t.\t+\t.\tgene=ORF1a NC_006577.2\tfeature\tgene\t13600\t21753\t.\t+\t.\tgene=ORF1b NC_006577.2\tfeature\tgene\t21773\t22933\t.\t+\t.\tg ene=HE NC_006577.2\tfeature\tgene\t22942\t27012\t.\t+\t.\tgene=Spike NC_006577.2\tfeature\tgene\t22978\t25221\t.\t+\t.\tgene=S1 NC_00657 7.2\tfeature\tgene\t27051\t27380\t.\t+\t.\tgene=S2 NC_006577.2\tfeature\tgene\t27051\t27380\t.\t+\t.\tgene=ORF4 NC_006577.2\tfeature\tge ne\t27373\t27621\t.\t+\t.\tgene=E NC_006577.2\tfeature\tgene\t27633\t28304\t.\t+\t.\tgene=M NC_006577.2\tfeature\tgene\t28320\t29645\t.\ t+\t.\tgene=N NC_006577.2\tfeature\tgene\t28342\t28959\t.\t+\t.\tgene=N2", expected struct GeneMap at line 2 column 1 GFF3 error: When processing gene, 'N': When processing feature group 'N' ('N') of type 'gene': genes must consist of exactly one f eature: Expected exactly one element, but found: 2 2: Location: /workdir/packages/nextclade/src/gene/gene_map.rs:56 ``` While looking at the referenced file, and comparing it to the other `genemap.gff` files in the config, I noticed that all the others used `gene_name` for everything after the first `gene` line. I changed this file to match, and the workflow got past the point where it was previously erroring out. I have no idea why this worked; hopefully somebody will explain in the code review.

ivan-aksamentov · 2024-05-17T00:24:13Z

We should probably also document the Nextalign-like usage in the main Nextclade docs, i.e. using Nextclade v3 without a dataset and providing individual files using --input-* args instead. The invocation of Nextclade v3 with individual args is mostly the same or is very similar to what Nextalign v2 used to be. And I believe that swapping nextclade in place of nextalign executables should produce somewhat informative errors.

Documenting it better would allow for smoother transition for v2 users and also highlight that Nextclade v3 can be used as an aligner even where there's no dataset for a particular organism.

Upd: I created an issue: nextstrain/nextclade#1456

genehack added the bug Something isn't working label May 14, 2024

genehack self-assigned this May 14, 2024

genehack added a commit that referenced this issue May 14, 2024

snakefmt applied to phylogenetic/rules/prepare_sequences.smk #2

2909a03

genehack mentioned this issue May 15, 2024

Update phylo workflow #3

Merged

genehack closed this as completed May 16, 2024

ivan-aksamentov mentioned this issue May 17, 2024

docs: document nextalign-like use-case nextstrain/nextclade#1456

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

phylogenetic build fails because of missing nextalign #2

phylogenetic build fails because of missing nextalign #2

genehack commented May 14, 2024

kimandrews commented May 14, 2024 •

edited

Loading

victorlin commented May 14, 2024

joverlee521 commented May 14, 2024

genehack commented May 14, 2024

ivan-aksamentov commented May 17, 2024 •

edited

Loading

phylogenetic build fails because of missing nextalign #2

phylogenetic build fails because of missing nextalign #2

Comments

genehack commented May 14, 2024

Current Behavior

Possible solution

kimandrews commented May 14, 2024 • edited Loading

victorlin commented May 14, 2024

joverlee521 commented May 14, 2024

genehack commented May 14, 2024

ivan-aksamentov commented May 17, 2024 • edited Loading

kimandrews commented May 14, 2024 •

edited

Loading

ivan-aksamentov commented May 17, 2024 •

edited

Loading