Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite config syntax #104

Draft
wants to merge 11 commits into
base: james/run-outside-of-repo
Choose a base branch
from

Conversation

jameshadfield
Copy link
Member

When I implemented the configs in #72 I considered using the config structure that this PR introduces but felt it was too different from the previous format (and other configs in the nextstrain ecosystem). After thinking it over in the intervening months I found the time to implement it here and think it massively simplifies things and makes the configs and snakefile much easier to reason with.

Background

This PR adds two large sections to the README which are good background to the changes here.

This google doc discusses a number of different config formats. Of those detailed in that doc I think this one is the only one that scales to workflows such as avian-flu.

Seasonal flu uses a somewhat related syntax for certain builds where we join wildcards together into a single string, but I don't think there's much else similar about the approaches.

The main idea

Avian-flu builds use a three-tiered wildcard setup of subtype, segment and time (for instance subtype=h5n1, segment=ha, time=2y is responsible for nextstrain.org/avian-flu/h5n1/ha/2y). For each config parameter we represent the values as a dict where the keys are a slash-separated string of the wildcards, with "*" as a match-anything symbol. For a given wildcard combination we search the config parameters and pick the most specific match.

So for a parameter which never changed across any wildcard combination you could use "*" for each wildcard:

filter: 
  target_sequences_per_tree:
    "*/*/*": 3000

And if you wanted to have the h5n1 2y builds to use 5,000 sequences, the h5n1 all-time builds to use 10,000 and all the other builds to use 3000 you could write:

filter: 
  target_sequences_per_tree:
    "h5n1/*/all-time": 10000
    "h5n1/*/2y": 5000 # or 'h5n1/*/*'
    "*/*/*": 3000

This is very powerful when different parameters vary across different wildcards, as they do frequently within avian-flu. (I don't know of any other format which can handle this nicely.) For instance the minimum length varies over segment but not subtype or time, so we can write

filter:
  min_length:
    "*/pb2/*": 2100
    "*/pb1/*": 2100
    "*/pa/*": 2000
    ...

whereas the columns for DTA vary over subtype but not segment or time, and we can use the exact same syntax to convey this very different combination:

traits:
  columns:
    "h5nx/*/*": region
    "h5n1/*/*": region country
    "h7n9/*/*": country division
    "h9n2/*/*": region country

Simplifications

This format is flexible enough to express every "special case" in avian-flu. For instance, we no longer have to special-case genome-specific parameters for the cattle-outbreak build (both in the configs & the snakefiles) such as the columns to use for DTA:

# config YAML
- genome_columns: division
- columns: region country
+ columns:
+   "*/genome/*": division
+   "*/*/*": region country

# Snakemake
- columns = get_config(['traits', 'genome_columns'], wildcards) \
-        if wildcards.segment=='genome' \
-        else get_config(['traits', 'columns'], wildcards)
+ columns = resolve_config_value(['traits', 'columns'], wildcards)

Being able to use one syntax everywhere makes it much easier to explain and use config overlays (see #103, which this PR sits on top of).

Refactored config declaration of which builds to run

See these added docs for a summary of the new format. The new structure is both more flexible and more amenable to config overlays as it is a list not a dict and will therefore be overwritten by an overlay. This makes it trivial to run a build for a subset of builds without touching the snakefile or main configs, which is a common occurrence for me given that the GISAID workflow produces 48 trees by default.

Misc

I also ended up allowing the workflow to only target certain intermediate files (e.g. metadata & sequences) not run the full analysis. @lmoncla this was inspired by seeing your commented out additions in 1895865 which is something I've done commonly as well!

@jameshadfield
Copy link
Member Author

jameshadfield commented Nov 22, 2024

If you'd like to test this (please do!) here's a simple example which covers at a conceptual level most of the changes introduced by this PR and #103:

From within the avian-flu repo checkout this PR:

git pull && git checkout 'james/update-config-syntax'
AVIAN_FLU=$( pwd )

And then in a separate analysis directory (outside of the repo) create a YAML file config.yaml:

filter:
  target_sequences_per_tree: 1000 # Note: this is the same as "*/*/*": 1000

builds:
  - subtype: h5n1
    segment:
      - ha
      - na
    time: 2y

traits:
  confidence:
    "*/ha/*": true # infer confidence for HA builds
    "*/*/*": false # not for everything else (the current default)

Then, within this analysis directory, run

snakemake --cores 2 --snakefile ${AVIAN_FLU}/gisaid/Snakefile -pf

This toy example allows us to easily compare the effect of inferring confidence for HA (LHS) vs not for NA (RHS):

image

Changes the syntax for a rule parameter for to either be a scalar or a
dictionary where they keys represent the wildcards of the associated
build, allowing '*' to represent a catch-all for a wildcard. This allows
significant simplifications throughout the workflow and config YAMLs,
largely because this format lets us express parameters for any wildcard
combination without the need to maintain different structures for
genome-specific configs, or different structures for parameters which
vary per segment vs those which don't.

The subsequent commits will make more involved changes to the snakemake
pipeline facilitated by this change, this commit represents the minimal
set of changes to keep functionality.
Removes the need for any special-casing for genome builds as the new
config pattern is flexible enough to handle them
The new syntax allows this to be simpler and avoid the genome-specific
top-level keys. This allows a number of downstream simplifications
to the snakefiles, as well as increased robustness.
The new syntax allows this to be simpler and avoid the genome-specific
"root_genome" key, as well as simplifying the snakemake pipeline
Snakefile Outdated Show resolved Hide resolved
The new syntax allows this to be simpler and avoid the special-case
genome-specific keys in config['ancestral']. This also removes all
the associated genome-specific code in the Snakefile.
Removing hardcoded cattle-outbreak specific logic from the snakefiles
and the configs.
The new structure is both more flexible and more amenable to config
overlays. Specifying the segments per subtype (or set of subtypes)
allows a pipeline to produce different sets of segments (e.g. ha/na
for some subtypes, all 8 for others). Encoding as a list makes it much
simpler to override via additional configs. For instance, here's a
GISAID overlay which just does what you expect:

```yaml
builds:
  - subtype: h5n1
    segment:
      - ha
      - na
    time: 2y
```
A common use case is to want intermediate files (e.g. filtered metadata)
and this is a pain when we have many build targets as you have to
list these out on the command line manually. The other approach is
to modify the snakefile targets within the Snakefile, which is my
interpretation of the recent (commented out) additions in
1895865:

```
rule all:
    input:
        auspice_json = collect_builds()
        #sequences = expand("results/{subtype}/{segment}/sequences.fasta", segment=SEGMENTS, subtype=SUBTYPES),
        #metadata = expand("results/{subtype}/metadata.tsv", segment=SEGMENTS, subtype=SUBTYPES)
```

Allowing config overlays to set these is really nice. The above
snakefile changes can now be replaced with the following config (even
more important when we start thinking about analysis directories
separate to the pathogen repo):

```yaml
extends: gisaid.yaml
target_patterns:
  - "results/{subtype}/{segment}/sequences.fasta"
  - "results/{subtype}/metadata.tsv"
```
Shifting hardcoded parameters into configs is generally simpler to
reason with and opens the door (ever so slightly) to adding novel
(wildcard) subtypes to the pipeline via config-only modifications.
It makes more sense to specify this as a filtering parameter. We could
continue using a value which can't be changed according to wildcards
(e.g. `target_sequences_per_tree: 3000`) however by using the
"*/*/*: 3000" syntax we make it clearer that it's possible to make
this specific to certain builds.

The new syntax makes this trivial to implement using a
WIP as
- I removed the quickstart section. Do we want to maintain this?
- I need to check the "other build customisations" section and ensure
  they still work with the current setup
- this is a prototype, and so the invocation syntax is all subject to
  change!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant