Rewrite config syntax #104

jameshadfield · 2024-11-22T03:42:26Z

When I implemented the configs in #72 I considered using the config structure that this PR introduces but felt it was too different from the previous format (and other configs in the nextstrain ecosystem). After thinking it over in the intervening months I found the time to implement it here and think it massively simplifies things and makes the configs and snakefile much easier to reason with.

Background

This PR adds two large sections to the README which are good background to the changes here.

This google doc discusses a number of different config formats. Of those detailed in that doc I think this one is the only one that scales to workflows such as avian-flu.

Seasonal flu uses a somewhat related syntax for certain builds where we join wildcards together into a single string, but I don't think there's much else similar about the approaches.

The main idea

Avian-flu builds use a three-tiered wildcard setup of subtype, segment and time (for instance subtype=h5n1, segment=ha, time=2y is responsible for nextstrain.org/avian-flu/h5n1/ha/2y). For each config parameter we represent the values as a dict where the keys are a slash-separated string of the wildcards, with "*" as a match-anything symbol. For a given wildcard combination we search the config parameters and pick the most specific match.

So for a parameter which never changed across any wildcard combination you could use "*" for each wildcard:

filter: 
  target_sequences_per_tree:
    "*/*/*": 3000

And if you wanted to have the h5n1 2y builds to use 5,000 sequences, the h5n1 all-time builds to use 10,000 and all the other builds to use 3000 you could write:

filter: 
  target_sequences_per_tree:
    "h5n1/*/all-time": 10000
    "h5n1/*/2y": 5000 # or 'h5n1/*/*'
    "*/*/*": 3000

This is very powerful when different parameters vary across different wildcards, as they do frequently within avian-flu. (I don't know of any other format which can handle this nicely.) For instance the minimum length varies over segment but not subtype or time, so we can write

filter:
  min_length:
    "*/pb2/*": 2100
    "*/pb1/*": 2100
    "*/pa/*": 2000
    ...

whereas the columns for DTA vary over subtype but not segment or time, and we can use the exact same syntax to convey this very different combination:

traits:
  columns:
    "h5nx/*/*": region
    "h5n1/*/*": region country
    "h7n9/*/*": country division
    "h9n2/*/*": region country

Simplifications

This format is flexible enough to express every "special case" in avian-flu. For instance, we no longer have to special-case genome-specific parameters for the cattle-outbreak build (both in the configs & the snakefiles) such as the columns to use for DTA:

# config YAML
- genome_columns: division
- columns: region country
+ columns:
+   "*/genome/*": division
+   "*/*/*": region country

# Snakemake
- columns = get_config(['traits', 'genome_columns'], wildcards) \
-        if wildcards.segment=='genome' \
-        else get_config(['traits', 'columns'], wildcards)
+ columns = resolve_config_value(['traits', 'columns'], wildcards)

Being able to use one syntax everywhere makes it much easier to explain and use config overlays (see #103, which this PR sits on top of).

Refactored config declaration of which builds to run

See these added docs for a summary of the new format. The new structure is both more flexible and more amenable to config overlays as it is a list not a dict and will therefore be overwritten by an overlay. This makes it trivial to run a build for a subset of builds without touching the snakefile or main configs, which is a common occurrence for me given that the GISAID workflow produces 48 trees by default.

Misc

I also ended up allowing the workflow to only target certain intermediate files (e.g. metadata & sequences) not run the full analysis. @lmoncla this was inspired by seeing your commented out additions in 1895865 which is something I've done commonly as well!

jameshadfield · 2024-11-22T04:11:36Z

If you'd like to test this (please do!) here's a simple example which covers at a conceptual level most of the changes introduced by this PR and #103:

From within the avian-flu repo checkout this PR:

git pull && git checkout 'james/update-config-syntax'
AVIAN_FLU=$( pwd )

And then in a separate analysis directory (outside of the repo) create a YAML file config.yaml:

filter:
  target_sequences_per_tree: 1000 # Note: this is the same as "*/*/*": 1000

builds:
  - subtype: h5n1
    segment:
      - ha
      - na
    time: 2y

traits:
  confidence:
    "*/ha/*": true # infer confidence for HA builds
    "*/*/*": false # not for everything else (the current default)

Then, within this analysis directory, run

snakemake --cores 2 --snakefile ${AVIAN_FLU}/gisaid/Snakefile -pf

This toy example allows us to easily compare the effect of inferring confidence for HA (LHS) vs not for NA (RHS):

Changes the syntax for a rule parameter for to either be a scalar or a dictionary where they keys represent the wildcards of the associated build, allowing '*' to represent a catch-all for a wildcard. This allows significant simplifications throughout the workflow and config YAMLs, largely because this format lets us express parameters for any wildcard combination without the need to maintain different structures for genome-specific configs, or different structures for parameters which vary per segment vs those which don't. The subsequent commits will make more involved changes to the snakemake pipeline facilitated by this change, this commit represents the minimal set of changes to keep functionality.

Removes the need for any special-casing for genome builds as the new config pattern is flexible enough to handle them

The new syntax allows this to be simpler and avoid the genome-specific top-level keys. This allows a number of downstream simplifications to the snakefiles, as well as increased robustness.

The new syntax allows this to be simpler and avoid the genome-specific "root_genome" key, as well as simplifying the snakemake pipeline

Snakefile

h5n1-cattle-outbreak/config.yaml

The new syntax allows this to be simpler and avoid the special-case genome-specific keys in config['ancestral']. This also removes all the associated genome-specific code in the Snakefile.

Removing hardcoded cattle-outbreak specific logic from the snakefiles and the configs.

The new structure is both more flexible and more amenable to config overlays. Specifying the segments per subtype (or set of subtypes) allows a pipeline to produce different sets of segments (e.g. ha/na for some subtypes, all 8 for others). Encoding as a list makes it much simpler to override via additional configs. For instance, here's a GISAID overlay which just does what you expect: ```yaml builds: - subtype: h5n1 segment: - ha - na time: 2y ```

A common use case is to want intermediate files (e.g. filtered metadata) and this is a pain when we have many build targets as you have to list these out on the command line manually. The other approach is to modify the snakefile targets within the Snakefile, which is my interpretation of the recent (commented out) additions in 1895865: ``` rule all: input: auspice_json = collect_builds() #sequences = expand("results/{subtype}/{segment}/sequences.fasta", segment=SEGMENTS, subtype=SUBTYPES), #metadata = expand("results/{subtype}/metadata.tsv", segment=SEGMENTS, subtype=SUBTYPES) ``` Allowing config overlays to set these is really nice. The above snakefile changes can now be replaced with the following config (even more important when we start thinking about analysis directories separate to the pathogen repo): ```yaml extends: gisaid.yaml target_patterns: - "results/{subtype}/{segment}/sequences.fasta" - "results/{subtype}/metadata.tsv" ```

Shifting hardcoded parameters into configs is generally simpler to reason with and opens the door (ever so slightly) to adding novel (wildcard) subtypes to the pipeline via config-only modifications.

It makes more sense to specify this as a filtering parameter. We could continue using a value which can't be changed according to wildcards (e.g. `target_sequences_per_tree: 3000`) however by using the "*/*/*: 3000" syntax we make it clearer that it's possible to make this specific to certain builds. The new syntax makes this trivial to implement using a

WIP as - I removed the quickstart section. Do we want to maintain this? - I need to check the "other build customisations" section and ensure they still work with the current setup - this is a prototype, and so the invocation syntax is all subject to change!

jameshadfield mentioned this pull request Nov 25, 2024

wip! workflows as programs prototyping nextstrain/measles#55

Draft

1 task

jameshadfield force-pushed the james/run-outside-of-repo branch from e8eff3d to ad0f15c Compare November 27, 2024 23:45

jameshadfield added 4 commits November 28, 2024 16:09

Improve config.export.title structure

3b582b6

Removes the need for any special-casing for genome builds as the new config pattern is flexible enough to handle them

Improve config.traits structure

5aba493

The new syntax allows this to be simpler and avoid the genome-specific top-level keys. This allows a number of downstream simplifications to the snakefiles, as well as increased robustness.

Improve config.refine structure

695bb5f

The new syntax allows this to be simpler and avoid the genome-specific "root_genome" key, as well as simplifying the snakemake pipeline

jameshadfield force-pushed the james/update-config-syntax branch from 4078613 to efc2fe8 Compare November 28, 2024 03:12

jameshadfield force-pushed the james/run-outside-of-repo branch from ad0f15c to 90b994c Compare November 28, 2024 03:12

jameshadfield force-pushed the james/update-config-syntax branch from efc2fe8 to 2a373e9 Compare November 28, 2024 03:35

jameshadfield commented Nov 28, 2024

View reviewed changes

Snakefile Outdated Show resolved Hide resolved

jameshadfield commented Nov 28, 2024

View reviewed changes

h5n1-cattle-outbreak/config.yaml Outdated Show resolved Hide resolved

jameshadfield added 7 commits December 2, 2024 14:16

Improve config.ancestral structure

a4be469

The new syntax allows this to be simpler and avoid the special-case genome-specific keys in config['ancestral']. This also removes all the associated genome-specific code in the Snakefile.

Improve configs for genome reference

35dd531

Removing hardcoded cattle-outbreak specific logic from the snakefiles and the configs.

Shift subtype lookups to config

2a19000

Shifting hardcoded parameters into configs is generally simpler to reason with and opens the door (ever so slightly) to adding novel (wildcard) subtypes to the pipeline via config-only modifications.

WIP add docs to README

fb99903

WIP as - I removed the quickstart section. Do we want to maintain this? - I need to check the "other build customisations" section and ensure they still work with the current setup - this is a prototype, and so the invocation syntax is all subject to change!

jameshadfield force-pushed the james/update-config-syntax branch from 2a373e9 to fb99903 Compare December 2, 2024 01:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite config syntax #104

Rewrite config syntax #104

jameshadfield commented Nov 22, 2024

jameshadfield commented Nov 22, 2024 •

edited

Loading

Rewrite config syntax #104

Are you sure you want to change the base?

Rewrite config syntax #104

Conversation

jameshadfield commented Nov 22, 2024

Background

The main idea

Simplifications

Refactored config declaration of which builds to run

Misc

jameshadfield commented Nov 22, 2024 • edited Loading

jameshadfield commented Nov 22, 2024 •

edited

Loading