-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite config syntax #104
base: james/run-outside-of-repo
Are you sure you want to change the base?
Rewrite config syntax #104
Conversation
If you'd like to test this (please do!) here's a simple example which covers at a conceptual level most of the changes introduced by this PR and #103: From within the avian-flu repo checkout this PR: git pull && git checkout 'james/update-config-syntax'
AVIAN_FLU=$( pwd ) And then in a separate analysis directory (outside of the repo) create a YAML file filter:
target_sequences_per_tree: 1000 # Note: this is the same as "*/*/*": 1000
builds:
- subtype: h5n1
segment:
- ha
- na
time: 2y
traits:
confidence:
"*/ha/*": true # infer confidence for HA builds
"*/*/*": false # not for everything else (the current default) Then, within this analysis directory, run snakemake --cores 2 --snakefile ${AVIAN_FLU}/gisaid/Snakefile -pf This toy example allows us to easily compare the effect of inferring confidence for HA (LHS) vs not for NA (RHS): |
e8eff3d
to
ad0f15c
Compare
Changes the syntax for a rule parameter for to either be a scalar or a dictionary where they keys represent the wildcards of the associated build, allowing '*' to represent a catch-all for a wildcard. This allows significant simplifications throughout the workflow and config YAMLs, largely because this format lets us express parameters for any wildcard combination without the need to maintain different structures for genome-specific configs, or different structures for parameters which vary per segment vs those which don't. The subsequent commits will make more involved changes to the snakemake pipeline facilitated by this change, this commit represents the minimal set of changes to keep functionality.
Removes the need for any special-casing for genome builds as the new config pattern is flexible enough to handle them
The new syntax allows this to be simpler and avoid the genome-specific top-level keys. This allows a number of downstream simplifications to the snakefiles, as well as increased robustness.
The new syntax allows this to be simpler and avoid the genome-specific "root_genome" key, as well as simplifying the snakemake pipeline
4078613
to
efc2fe8
Compare
ad0f15c
to
90b994c
Compare
efc2fe8
to
2a373e9
Compare
The new syntax allows this to be simpler and avoid the special-case genome-specific keys in config['ancestral']. This also removes all the associated genome-specific code in the Snakefile.
Removing hardcoded cattle-outbreak specific logic from the snakefiles and the configs.
The new structure is both more flexible and more amenable to config overlays. Specifying the segments per subtype (or set of subtypes) allows a pipeline to produce different sets of segments (e.g. ha/na for some subtypes, all 8 for others). Encoding as a list makes it much simpler to override via additional configs. For instance, here's a GISAID overlay which just does what you expect: ```yaml builds: - subtype: h5n1 segment: - ha - na time: 2y ```
A common use case is to want intermediate files (e.g. filtered metadata) and this is a pain when we have many build targets as you have to list these out on the command line manually. The other approach is to modify the snakefile targets within the Snakefile, which is my interpretation of the recent (commented out) additions in 1895865: ``` rule all: input: auspice_json = collect_builds() #sequences = expand("results/{subtype}/{segment}/sequences.fasta", segment=SEGMENTS, subtype=SUBTYPES), #metadata = expand("results/{subtype}/metadata.tsv", segment=SEGMENTS, subtype=SUBTYPES) ``` Allowing config overlays to set these is really nice. The above snakefile changes can now be replaced with the following config (even more important when we start thinking about analysis directories separate to the pathogen repo): ```yaml extends: gisaid.yaml target_patterns: - "results/{subtype}/{segment}/sequences.fasta" - "results/{subtype}/metadata.tsv" ```
Shifting hardcoded parameters into configs is generally simpler to reason with and opens the door (ever so slightly) to adding novel (wildcard) subtypes to the pipeline via config-only modifications.
It makes more sense to specify this as a filtering parameter. We could continue using a value which can't be changed according to wildcards (e.g. `target_sequences_per_tree: 3000`) however by using the "*/*/*: 3000" syntax we make it clearer that it's possible to make this specific to certain builds. The new syntax makes this trivial to implement using a
WIP as - I removed the quickstart section. Do we want to maintain this? - I need to check the "other build customisations" section and ensure they still work with the current setup - this is a prototype, and so the invocation syntax is all subject to change!
2a373e9
to
fb99903
Compare
When I implemented the configs in #72 I considered using the config structure that this PR introduces but felt it was too different from the previous format (and other configs in the nextstrain ecosystem). After thinking it over in the intervening months I found the time to implement it here and think it massively simplifies things and makes the configs and snakefile much easier to reason with.
Background
This PR adds two large sections to the README which are good background to the changes here.
This google doc discusses a number of different config formats. Of those detailed in that doc I think this one is the only one that scales to workflows such as avian-flu.
Seasonal flu uses a somewhat related syntax for certain builds where we join wildcards together into a single string, but I don't think there's much else similar about the approaches.
The main idea
Avian-flu builds use a three-tiered wildcard setup of
subtype
,segment
andtime
(for instancesubtype=h5n1
,segment=ha
,time=2y
is responsible fornextstrain.org/avian-flu/h5n1/ha/2y
). For each config parameter we represent the values as a dict where the keys are a slash-separated string of the wildcards, with "*" as a match-anything symbol. For a given wildcard combination we search the config parameters and pick the most specific match.So for a parameter which never changed across any wildcard combination you could use "*" for each wildcard:
And if you wanted to have the h5n1 2y builds to use 5,000 sequences, the h5n1 all-time builds to use 10,000 and all the other builds to use 3000 you could write:
This is very powerful when different parameters vary across different wildcards, as they do frequently within avian-flu. (I don't know of any other format which can handle this nicely.) For instance the minimum length varies over segment but not subtype or time, so we can write
whereas the columns for DTA vary over subtype but not segment or time, and we can use the exact same syntax to convey this very different combination:
Simplifications
This format is flexible enough to express every "special case" in avian-flu. For instance, we no longer have to special-case genome-specific parameters for the cattle-outbreak build (both in the configs & the snakefiles) such as the columns to use for DTA:
Being able to use one syntax everywhere makes it much easier to explain and use config overlays (see #103, which this PR sits on top of).
Refactored config declaration of which builds to run
See these added docs for a summary of the new format. The new structure is both more flexible and more amenable to config overlays as it is a list not a dict and will therefore be overwritten by an overlay. This makes it trivial to run a build for a subset of builds without touching the snakefile or main configs, which is a common occurrence for me given that the GISAID workflow produces 48 trees by default.
Misc
I also ended up allowing the workflow to only target certain intermediate files (e.g. metadata & sequences) not run the full analysis. @lmoncla this was inspired by seeing your commented out additions in 1895865 which is something I've done commonly as well!