From fb999036fa3307d2cf7e38114bfbac973d583e46 Mon Sep 17 00:00:00 2001
From: james hadfield <hadfield.james@gmail.com>
Date: Fri, 22 Nov 2024 15:36:46 +1300
Subject: [PATCH] WIP add docs to README

WIP as
- I removed the quickstart section. Do we want to maintain this?
- I need to check the "other build customisations" section and ensure
  they still work with the current setup
- this is a prototype, and so the invocation syntax is all subject to
  change!
---
 README.md | 136 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 126 insertions(+), 10 deletions(-)

diff --git a/README.md b/README.md
index 4ec2f54..be7b1d2 100755
--- a/README.md
+++ b/README.md
@@ -6,10 +6,20 @@ Please see [nextstrain.org/docs](https://nextstrain.org/docs) for details about
 
 The Snakemake pipeline is parameterised by two config files, one for the A/H5N1, A/H5NX, A/H7N9, and A/H9N2 builds and one for the 2024 A/H5N1 cattle-flu outbreak.
 
+### Table of Contents
 
-## Segment-level GISAID builds
+* [Running segment-level GISAID builds](#running-segment-level-gisaid-builds)
+* [Running H5N1 Cattle Outbreak (2024) builds](#running-h5n1-cattle-outbreak-2024-builds)
+* [Creating a custom build via config overlays](#creating-a-custom-build-via-config-overlays)
+* [Running builds in a separate working directory](#running-builds-in-a-separate-working-directory)
+* [Cleavage Site Annotations](#cleavage-side-annotations)
+* [H5 Clade Labeling](#h5-clade-labeling)
+* [Other build customisations](#other-build-customisations)
 
-The `config/gisaid.yaml` config builds 32 Auspice datasets (8 segments x 4 subtypes (A/H5N1, A/H5NX, A/H7N9, A/H9N2)) using GISAID data and can be run via
+
+## Running segment-level GISAID builds
+
+The `config/gisaid.yaml` config builds 48 Auspice datasets (8 segments x 4 subtypes (A/H5N1, A/H5NX, A/H7N9, A/H9N2) x 1-2 time resolutions (all-time, 2y)) using GISAID data and can be run via
 
 ```bash
 snakemake --cores 1 -pf --configfile config/gisaid.yaml
@@ -34,7 +44,7 @@ to nextstrain.org by running:
 nextstrain build . --configfile config/gisaid.yaml -f deploy_all
 ```
 
-## H5N1 Cattle Outbreak (2024)
+## Running H5N1 Cattle Outbreak (2024) builds
 
 We produce per-segment and whole-genome (concatenated segments) builds for the ongoing H5N1 cattle-flu outbreak.
 These use NCBI data including consensus genomes and SRA data assembled via the Andersen lab's [avian-influenza repo](https://github.com/andersen-lab/avian-influenza).
@@ -64,15 +74,119 @@ This should allow any reassortments to be highlighted and will also include outb
 > Note that generating any segment-level build here will necessarily build the genome tree, as it's needed to identify the clade of interest in each segment.
 
 
-## Creating a custom build
-The easiest way to generate your own, custom avian-flu build is to use the quickstart-build as a starting template. Simply clone the quickstart-build, run with the example data, and edit the Snakefile to customize. This build includes example data and a simplified, heavily annotated Snakefile that goes over the structure of Snakefiles and annotates rules and inputs/outputs that can be modified. This build, with it's own readme, is available [here](https://github.com/nextstrain/avian-flu/tree/master/quickstart-build).
+## Creating a custom build via config overlays
+
+The two config files introduced above are designed to be extended (customised) by overlaying your own YAML configurations.
+The aim is to allow easy customisation of the workflow outputs (e.g. which subtypes / segments / time-frames to run) and the stopping points (Auspice JSONs or particular intermediate files) via such overlays.
+Additionally, modification of parameters (e.g. clock rates, minimum lengths, subsampling criteria, DTA metadata) is possible through overlays without needing to modify the underlying base config or Snakemake pipeline itself.
+
+Config overlays allow you to essentially maintain one or more modifications of the workflow for your own purposes.
+For instance you may want a way to easily run only H5N1 HA+NA 2y builds, and a config overlay can achieve this.
+Using an overlay keeps this change separate from the config used for Nextstrain automation and also avoids `git` conflicts emerging over time.
+When combined with running in a separate working directory (explained below) this becomes even more powerful.
+
+As an example we'll create a custom config `config_local.yaml` which you can then add to the build command described above, e.g. `nextstrain build . --configfile config/gisaid.yaml config_local.yaml`.
+We'll use the GISAID config as the example we're extending, however the concepts are the similar for the H5N1 cattle outbreak config.
+
+
+### Restrict which builds we want to produce
+
+By default we produce 48 Auspice JSONs (4 subtypes * 8 segments * 1-2 time resolutions). We can restrict these by redefining the builds in our config overlay. For instance the following will produce only 2 datasets, `h5n1/ha/2y` and `h5n1/na/2y`:
+```yaml
+builds:
+  - subtype: h5n1
+    segment:
+      - ha
+      - na
+    time: 2y
+```
+Here `builds` is an array of sub-configs, each of which define a combination of subtype, segment and time parameters. Each subtype, segment and time can be a single string (as subtype and time are, above) or an array (as segment is, above).
+
+### Only produce certain intermediate files, not Auspice datasets
+
+By default the "target" for each build is the Auspice JSON (`auspice/avian-flu_h5n1_ha_2y.json` etc) however we can change this if we just want certain intermediate files. Adding the following to the config overlay will stop once we've filtered to the metadata & sequences we would use for each tree
+
+```yaml
+target_patterns:
+  - "results/{subtype}/{segment}/{time}/metadata.tsv"
+  - "results/{subtype}/{segment}/{time}/sequences.tsv"
+```
+
+(This works in combination with the custom `builds` definition, above).
+
+### Change the target number of sequences per build
+
+In the `config/gisaid.yaml` this parameter is defined as
+
+```yaml
+filter: 
+  target_sequences_per_tree:
+    "*/*/*": 3000
+```
+
+Where the `"*/*/*"` syntax is slash-separated matchers for subtype, segment and time-resolution, and the `*` character means "match everything".
+So here we're saying "for every subtype, for every segment, for every time-resolution target 3000 sequences".
+
+If we want to change our h5n1 builds to instead have 5000 sequences (whilst keeping the rest at 3000) we could add the following to our config overlay:
+```yaml
+filter: 
+  target_sequences_per_tree:
+    "h5n1/*/*": 5000
+```
+And since `"h5n1/*/*"` is more specific than `"*/*/*"` it'll take precedence when `subtype="h5n1"`.
+(Internally, Snakemake merges these configs together resulting in `'target_sequences_per_tree': {'h5n1/*/*': 5000, '*/*/*': 3000}`. For each combination of subtype/segment/time values within the workflow we consult these dictionaries and pick the most specific match.)
 
-## Features unique to avian flu builds
 
-#### cleavage site annotations
+This syntax is concise but powerful, for instance we can parameterise the builds like so:
+```yaml
+filter: 
+  target_sequences_per_tree:
+    "*/*/all-time": 5000 # target 5k sequences for the all-time builds
+    "h5n1/*/all-time": 10000 # but for h5n1 all-time builds target 10k sequences (this is more specific as it only includes one '*' character)
+    "*/*/*": 1000 # target 1k sequences for any other builds (i.e. the 2y builds)
+```
+
+
+
+### Change other parameters
+
+The pattern introduced in the preceeding section generally applies for all parameters in the workflow.
+By reading through the base config YAML (e.g. `config/gisaid.yaml`) you should be able to learn the config structure and add then modify that within your overlay config as needed.
+
+If the parameter is not exposed via the config YAML and you find yourself modifying the underlying Snakefile consider exposing it via the config so that this customisation then becomes available to config overlays.
+
+
+## Running builds in a separate working directory
+
+> Note: This section doesn't yet work using AWS / docker runtimes but support is forthcoming.
+
+We can run the worfklow from an analysis directory separate to this repo. Imagine the following directory structure:
+
+```console
+├── avian-flu
+│   ├── README.md
+│   ├── Snakefile
+│   └── ... (lots more files here)
+└── analysis_dir
+    └── config.yaml
+```
+
+Where `config.yaml` is the your config overlay, introduced in the previous section as `config_local.yaml`.
+
+> TODO XXX following uses a different invocation - this is all in flux
+
+From within the `analysis_dir` we can run `snakemake --snakefile ../avian-flu/gisaid/Snakefile --cores 1` and we'll automatically run the workflow using your `config.yaml` config overlay.
+This keeps the workflow outputs separate and isolated from the workflow itself.
+Depending on how you run builds this can be very liberating; for instance if you are switching between versions of the workflow you can maintain different analysis directories for each version, or if you are running the workflow at different times you can separate the analyses for easy before/after comparisons.
+
+> You don't have to name it `config.yaml`, but if you use a different filename you'll have to specify it via `--configfile <filename>`.
+
+
+## Cleavage Site Annotations
+
 Influenza virus HA is translated as a single peptide (HA0) that is cleaved to form the mature, functional form (HA1 and HA2). In all human strains and many avian strains, the cleavage site is composed of a single, basic amino acid residue. However, some avian influenza subtypes, particularly H5s, have acquired additional basic residues immediately preceding the HA cleavage site. In some cases, this results in addition of a furin cleavage motif, allowing HA to be cleaved by furin, which is ubiquitously expressed, and allows for viral replication across a range of tissues. The addition of this "polybasic cleavage site" is one of the prime determinants of avian influenza virulence. In these builds, we have annotated whether strains contain a furin cleavage motif, defined here as the sequence `R-X-K/R-R` immediately preceding the start of HA2, where `X` can be any amino acid. We have also added a color by for the cleavage site sequence, which we define here as the 4 bases preceding HA2.
 
-#### clade labeling
+## H5 Clade Labeling
 H5 viruses are classified into clades, which are currently viewable as a color by on [nextstrain.org](https://nextstrain.org/avian-flu/h5n1/ha?c=h5_label_clade). Because clade annotations are not available in all public databases, we annotate sequences with their most likely clade using a tool developed by Samuel S. Shepard at CDC called [LABEL](https://wonder.cdc.gov/amd/flu/label/). The assigned clade for each H5N1 or H5Nx sequence are available as public tsvs [here](https://github.com/nextstrain/avian-flu/tree/master/clade-labeling).
 
 To update the `clades.tsv` file with clade annotations for new sequences, run:
@@ -87,7 +201,9 @@ To string these together and update the `clades.tsv` file for new sequences and
 
 `snakemake -s Snakefile.clades -p --cores 1 && snakemake -p --cores 1`
 
-#### Using the same strains for all segments
+## Other build customisations
+
+### Using the same strains for all segments
 
 By default we subsample data for each segment independently.
 Alternatively, you can ask the pipeline to use the same strains for each segment.
@@ -106,7 +222,7 @@ If you are using `nextstrain build` then add that to the end of the command (i.e
 
 Note that you may need to remove any existing data in `results/` in order for snakemake to correctly regenerate the intermediate files.
 
-#### Using locally ingested data (instead of downloading from S3)
+### Using locally ingested data (instead of downloading from S3)
 
 Run the pipeline with `--config 'local_ingest=True'` to use the locally available files produced by the ingest pipeline (see `./ingest/README.md` for details on how to run).
 Specifically, the files needed are `ingest/results/metadata.tsv` and `ingest/results/sequences_{SEGMENT}.fasta`.