Development branch for the next release (#29)

This is a PR for the next version of FIRE that will be more robust to denovo assemblies. Some feats: Save the Fiber-seq data in cram format Replace ucsc utilities with bigtools Replace some bedtools opts with gia Switch to polars for coverage calculation Rename config options to uppercase variables fix #28 incorrect counts in a plot fix #26 so that fire appears to be deterministic now; however, this is not yet guaranteed.
fiberseq · Jul 31, 2024 · 99d40f1 · 99d40f1
1 parent 0810fc8
commit 99d40f1
Show file tree

Hide file tree

Showing 22 changed files with 425 additions and 230 deletions.
diff --git a/INSTALL.md b/INSTALL.md
@@ -0,0 +1,18 @@
+# Install
+
+You will need **snakemake** which you can install using conda/mamba, e.g.:
+```
+mamba create -c conda-forge -c bioconda -n snakemake 'snakemake>=8.4'
+```
+
+Finally, if you wish to distribute jobs across a cluster you will need to install the appropriate [snakemake executor plugin](https://snakemake.github.io/snakemake-plugin-catalog/). For example, to use SLURM you can install the `snakemake-executor-slurm` plugin using pip:
+```  
+pip install snakemake-executor-plugin-slurm
+```
+
+We recommend adding a snakemake conda prefix to your `bashrc`, e.g. in the Stergachis lab add:
+```bash
+export SNAKEMAKE_CONDA_PREFIX=/mmfs1/gscratch/stergachislab/snakemake-conda-envs
+export APPTAINER_CACHEDIR=/mmfs1/gscratch/stergachislab/snakemake-conda-envs/apptainer-cache
+```
+Then snakemake installs all the additional requirements as conda envs in that directory.
diff --git a/README.md b/README.md
@@ -6,25 +6,7 @@ A Snakemake workflow for calling Fiber-seq Inferred Regulatory Elements (FIREs)
 
 ## Install
 
-You will need **snakemake** and all the **UCSC Kent utilities** and the **latest version** of them (v455).
-
-You can install snakemake using conda/mamba, e.g.:
-```
-mamba create -c conda-forge -c bioconda -n snakemake 'snakemake>=8.4'
-```
-
-You can find the UCSC kent utilities at [this url](http://hgdownload.soe.ucsc.edu/admin/exe/). You will need to add the directory containing the utilities to your `PATH` environment variable.
-
-Finally, if you wish to distribute jobs across a cluster you will need to install the appropriate [snakemake executor plugin](https://snakemake.github.io/snakemake-plugin-catalog/). For example, to use SLURM you can install the `snakemake-executor-slurm` plugin using pip:
-```  
-pip install snakemake-executor-plugin-slurm
-```
-
-We recommend adding a snakemake conda prefix to your `bashrc`, e.g. in the Stergachis lab add:
-```bash
-export SNAKEMAKE_CONDA_PREFIX=/mmfs1/gscratch/stergachislab/snakemake-conda-envs
-```
-Then snakemake installs all the additional requirements as conda envs in that directory.
+Please install `snakemake` and all the UCSC Kent utilities. For detailed instructions see the [installation README](/INSTALL.md).
 
 ## Configuring
 

diff --git a/_config.yml b/_config.yml
@@ -0,0 +1,20 @@
+#theme: jekyll-theme-minimal
+#theme: minima
+remote_theme: jekyll/minima
+#logo: /assets/img/fiber_tools_grey.png
+
+minima:
+  skin: dark
+
+defaults:
+  - scope:
+      path: "README.md"
+    values:
+      permalink: "index.html"
+
+exclude:
+  - target
+  - models
+  - dists
+  - tmp.*
+  - temp
diff --git a/config/README.md b/config/README.md
@@ -10,7 +10,7 @@ Reference `fasta` file:
 ```
 ref: /path/to/hg38.fa
 ```
-Manifest of input sample(s), must have two white-space separated columns: sample name (`sample`) and input bam file path (`bam`). See `config.tbl` for an example.
+Manifest of input sample(s), must have two white-space separated columns: sample name (`sample`) and input bam file path (`bam`). See `config.tbl` for an example. The `bam` file must be indexed and aligned to the reference genome in the `ref` option.
 ```
 manifest: config/config.tbl
 ```

diff --git a/docs/README.md b/docs/README.md
@@ -102,9 +102,7 @@ The peak results are reported in the file `FDR-peaks/FDR-FIRE-peaks.bed.gz` with
 |               | {sample}.{median, maximum,minimum}.coverage.txt | Allowed coverage range for analysis                                     |
 | fiber-calls   |                                                 |                                                                         |
 |               | FIRE.bed.gz                                     | Every FIRE element in every fiber                                       |
-|               | model.results.bed.gz                            | Every FIRE, linker, and nucleosome in every fiber                       |
 |               | fire-fibers.bed.gz                              | bed12 start and end of every fiber                                      |
-|               | fire-fiber-decorators.bed.gz                    | decorator file that adds annotations of FIRE elements to the bed12 file |
 | FDR-peaks     |                                                 |                                                                         |
 |               | FDR-FIRE-peaks.bed.gz                           | Fiber-seq peak calls                                                    |
 |               | FDR-wide-peaks.bed.gz                           | Fiber-seq wide peak calls                                               |

diff --git a/fire b/fire
@@ -30,7 +30,7 @@ if [[ "${has_config}" == false ]]; then
 fi
 
 # check for executables
-for x in snakemake bedToBigBed bedGraphToBigWig; do
+for x in snakemake; do
     if [[ ! $(type -P "${x}") ]]; then
         echo "Error: ${x} not found in PATH, but it is required for FIRE."
         exit 1

diff --git a/workflow/Snakefile b/workflow/Snakefile
@@ -4,60 +4,64 @@ import sys
 import os
 from snakemake.utils import min_version
 
-min_version("8.4.0")
+min_version("8.12.0")
+
+# forces pandas pysam etc to be available in the environment
+conda: "envs/runner.yaml"
 
 # Force users to use the same underlying OS via singularity.
 # container: "docker://condaforge/mambaforge:23.3.1-1"
 # container: "docker://continuumio/miniconda3"
 
 
-ref = config["ref"]
-ref_name = config["ref_name"]
+# setup shared functions
+include: "rules/common.smk"
 
-dhs = config.get("dhs", workflow.source_path("annotations/GM12878_DHS.bed.gz"))
-excludes = config.get("excludes", [])
-if ref_name == "hg38" or ref_name == "GRCh38":
-    files = [
-        "annotations/hg38.blacklist.ENCFF356LFX.bed.gz",
-        "annotations/hg38.gap.bed.gz",
-        "annotations/SDs.merged.hg38.bed.gz",
-    ]
-    excludes += [workflow.source_path(file) for file in files]
 
-max_t = config.get("max_t", 4)
-keep_chrs = config.get("keep_chromosomes", ".*")
+# reference genome and reference regions
+REF = get_ref()
+FAI = get_fai()
+REF_NAME = config["ref_name"]
+EXCLUDES = get_excludes()
+KEEP_CHRS = config.get("keep_chromosomes", ".*")
 
 # coverage requirements
-min_coverage = config.get("min_coverage", 4)
-coverage_within_n_sd = config.get("coverage_within_n_sd", 5)
+MIN_COVERAGE = config.get("min_coverage", 4)
+COVERAGE_WITHIN_N_SD = config.get("coverage_within_n_sd", 5)
 
 # sample, haplotype, and chromosome wildcard building
-fai = pd.read_csv(f"{ref}.fai", sep="\t", names=["chr", "length", "x", "y", "z"])
-default_env = config.get("env", "../envs/env.yaml")
-data = pd.read_csv(config["manifest"], sep=r"\s+", comment="#").set_index("sample")
-haps = ["all", "hap1", "hap2", "unk"]
-not_all = ["hap1", "hap2", "unk"]
-not_unk = ["all", "hap1", "hap2"]
-min_fire_fdr = config.get("min_fire_fdr", 0.10)
+FAI_DF = get_fai_df()
+DEFAULT_ENV = config.get("env", "../envs/env.yaml")
+MANIFEST = get_manifest()
+MIN_FIRE_FDR = config.get("min_fire_fdr", 0.10)
 
 # FDR / peak calling thresholds
-max_peak_fdr = config.get("max_peak_fdr", 0.05)
-min_per_acc_peak = config.get("min_per_acc_peak", None)
-if min_per_acc_peak is None:
-    min_per_acc_peak = 0.0
+MAX_PEAK_FDR = config.get("max_peak_fdr", 0.05)
+MIN_PER_ACC_PEAK = config.get("min_per_acc_peak", None)
+if MIN_PER_ACC_PEAK is None:
+    MIN_PER_ACC_PEAK = 0.0
 else:
-    max_peak_fdr = 1.0
-
+    MAX_PEAK_FDR = 1.0
+MIN_FRAC_ACCESSIBLE = config.get("min_frac_accessible", 0)
 
+# Misc sets of wildcards
+haps = ["all", "hap1", "hap2", "unk"]
+not_unk = ["all", "hap1", "hap2"]
 types = ["fdr", "acc", "link", "nuc"]
 types_to_col = {"fdr": 4, "acc": 5, "link": 6, "nuc": 7}
-
 bw_types = ["log_FDR"]  # "score", "FDR",
 bw_types = bw_types + [f"{t}_H1" for t in bw_types] + [f"{t}_H2" for t in bw_types]
 el_types = ["fire", "linker", "nucleosome"]
 
+# developer options
+FT_EXE = config.get("ft_exe", "ft")
+ONT = config.get("ont", False)
+USE_ONT = ""
+if ONT:
+    USE_ONT = " --ont --ml 225 "
+MIN_UNRELIABLE_COVERAGE_LEN = config.get("min_unreliable_coverage_len", 50)
+
 
-include: "rules/common.smk"
 include: "rules/apply-model.smk"
 include: "rules/coverages.smk"
 include: "rules/FDR-peaks.smk"
@@ -66,13 +70,10 @@ include: "rules/decorated-reads.smk"
 include: "rules/track-hub.smk"
 
 
-chroms = get_chroms()
-
-
 wildcard_constraints:
-    chrom="|".join(chroms),
+    chrom="|".join(get_chroms()),
     call="|".join(["msp", "m6a"]),
-    sm="|".join(data.index),
+    sm="|".join(MANIFEST.index),
     types="|".join(types),
     fdr=r"\d+",
     hp="|".join(haps),
@@ -83,39 +84,41 @@ wildcard_constraints:
 localrules:
     trackhub,
 
+
 rule all:
     input:
         # coverage information
-        expand(rules.genome_bedgraph.output, sm=data.index),
-        expand(rules.coverage.output, sm=data.index),
-        expand(rules.exclude_from_shuffle.output, sm=data.index),
-        expand(rules.unreliable_coverage_regions.output, sm=data.index),
+        expand(rules.genome_bedgraph.output, sm=MANIFEST.index),
+        expand(rules.coverage.output, sm=MANIFEST.index),
+        expand(rules.exclude_from_shuffle.output, sm=MANIFEST.index),
+        expand(rules.unreliable_coverage_regions.output, sm=MANIFEST.index),
         # model results
-        expand(rules.fire_sites.output, sm=data.index),
+        expand(rules.fire_sites.output, sm=MANIFEST.index),
         # fiber locations
-        expand(rules.fiber_locations.output, sm=data.index),
-        expand(rules.filtered_and_shuffled_fiber_locations.output, sm=data.index),
+        expand(rules.fiber_locations.output, sm=MANIFEST.index),
+        expand(rules.filtered_and_shuffled_fiber_locations.output, sm=MANIFEST.index),
         # coverage of elements
         expand(
             rules.element_coverages.output,
-            sm=data.index,
+            sm=MANIFEST.index,
             hp=not_unk,
             el_type=el_types,
         ),
         # FDR results
-        expand(rules.fdr_track.output, sm=data.index),
-        expand(rules.fdr_peaks_by_fire_elements.output, sm=data.index),
-        expand(rules.wide_fdr_peaks.output, sm=data.index),
-        expand(rules.peaks_vs_percent.output, sm=data.index),
-        expand(rules.one_percent_fdr_peaks.output, sm=data.index),
+        expand(rules.fdr_track.output, sm=MANIFEST.index),
+        expand(rules.fdr_peaks_by_fire_elements.output, sm=MANIFEST.index),
+        expand(rules.wide_fdr_peaks.output, sm=MANIFEST.index),
+        expand(rules.peaks_vs_percent.output, sm=MANIFEST.index),
+        expand(rules.one_percent_fdr_peaks.output, sm=MANIFEST.index),
+        expand(rules.fires_in_peaks.output.txt, sm=MANIFEST.index),
         # trackhub
-        expand(rules.fdr_track_to_bw.output.bw, sm=data.index, col=bw_types),
-        expand(rules.fdr_peaks_by_fire_elements_to_bb.output.bb, sm=data.index),
-        expand(rules.percent_accessible.output.bw, hp=not_unk, sm=data.index),
+        expand(rules.fdr_track_to_bw.output.bw, sm=MANIFEST.index, col=bw_types),
+        expand(rules.fdr_peaks_by_fire_elements_to_bb.output.bb, sm=MANIFEST.index),
+        expand(rules.percent_accessible.output.bw, hp=not_unk, sm=MANIFEST.index),
         expand(
             rules.element_coverages_bw.output.bw,
-            sm=data.index,
+            sm=MANIFEST.index,
             hp=not_unk,
             el_type=el_types,
         ),
-        expand(rules.trackhub.output, sm=data.index),
+        expand(rules.trackhub.output, sm=MANIFEST.index),
diff --git a/workflow/envs/env.yaml b/workflow/envs/env.yaml
@@ -7,7 +7,8 @@ dependencies:
   - samtools==1.19.1
   - htslib==1.19.1
   - bedtools==2.31
-  - fibertools-rs>=0.3.9
+  - bioconda::fibertools-rs==0.5.3
+  - bioconda::gia
   - seqtk
   - hck>=0.9.2
   - bioawk
@@ -17,5 +18,4 @@ dependencies:
   - parallel
   - mosdepth==0.3.7
   - bedops
-  #- ucsc-bedgraphtobigwig
-  #- ucsc-bedtobigbed
+  - bioconda::bigtools==0.5.1
diff --git a/workflow/envs/python.yaml b/workflow/envs/python.yaml
@@ -8,7 +8,7 @@ dependencies:
   - tqdm
   - defopt
   - numpy==1.24 # use this version to avoid cxx errors
-  - numba::numba>=0.56
+  - numba::numba==0.60
   - pandas==1.4 # pandas versions later than this have cxx errors
   - pysam==0.21
   - htslib==1.19.1

diff --git a/workflow/envs/runner.yaml b/workflow/envs/runner.yaml
@@ -0,0 +1,13 @@
+name: runner
+channels:
+  - numba
+  - conda-forge
+  - bioconda
+  - defaults
+dependencies:
+  - tqdm
+  - numpy
+  - pandas
+  - pip
+  - pip:
+      - pysam