Skip to content

Commit

Permalink
Development branch for the next release (#29)
Browse files Browse the repository at this point in the history
This is a PR for the next version of FIRE that will be more robust to denovo assemblies.

Some feats:

Save the Fiber-seq data in cram format
Replace ucsc utilities with bigtools
Replace some bedtools opts with gia
Switch to polars for coverage calculation
Rename config options to uppercase variables
fix #28 incorrect counts in a plot
fix #26 so that fire appears to be deterministic now; however, this is not yet guaranteed.
  • Loading branch information
mrvollger authored Jul 31, 2024
1 parent 0810fc8 commit 99d40f1
Show file tree
Hide file tree
Showing 22 changed files with 425 additions and 230 deletions.
18 changes: 18 additions & 0 deletions INSTALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Install

You will need **snakemake** which you can install using conda/mamba, e.g.:
```
mamba create -c conda-forge -c bioconda -n snakemake 'snakemake>=8.4'
```

Finally, if you wish to distribute jobs across a cluster you will need to install the appropriate [snakemake executor plugin](https://snakemake.github.io/snakemake-plugin-catalog/). For example, to use SLURM you can install the `snakemake-executor-slurm` plugin using pip:
```
pip install snakemake-executor-plugin-slurm
```

We recommend adding a snakemake conda prefix to your `bashrc`, e.g. in the Stergachis lab add:
```bash
export SNAKEMAKE_CONDA_PREFIX=/mmfs1/gscratch/stergachislab/snakemake-conda-envs
export APPTAINER_CACHEDIR=/mmfs1/gscratch/stergachislab/snakemake-conda-envs/apptainer-cache
```
Then snakemake installs all the additional requirements as conda envs in that directory.
20 changes: 1 addition & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,25 +6,7 @@ A Snakemake workflow for calling Fiber-seq Inferred Regulatory Elements (FIREs)

## Install

You will need **snakemake** and all the **UCSC Kent utilities** and the **latest version** of them (v455).

You can install snakemake using conda/mamba, e.g.:
```
mamba create -c conda-forge -c bioconda -n snakemake 'snakemake>=8.4'
```

You can find the UCSC kent utilities at [this url](http://hgdownload.soe.ucsc.edu/admin/exe/). You will need to add the directory containing the utilities to your `PATH` environment variable.

Finally, if you wish to distribute jobs across a cluster you will need to install the appropriate [snakemake executor plugin](https://snakemake.github.io/snakemake-plugin-catalog/). For example, to use SLURM you can install the `snakemake-executor-slurm` plugin using pip:
```
pip install snakemake-executor-plugin-slurm
```

We recommend adding a snakemake conda prefix to your `bashrc`, e.g. in the Stergachis lab add:
```bash
export SNAKEMAKE_CONDA_PREFIX=/mmfs1/gscratch/stergachislab/snakemake-conda-envs
```
Then snakemake installs all the additional requirements as conda envs in that directory.
Please install `snakemake` and all the UCSC Kent utilities. For detailed instructions see the [installation README](/INSTALL.md).

## Configuring

Expand Down
20 changes: 20 additions & 0 deletions _config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#theme: jekyll-theme-minimal
#theme: minima
remote_theme: jekyll/minima
#logo: /assets/img/fiber_tools_grey.png

minima:
skin: dark

defaults:
- scope:
path: "README.md"
values:
permalink: "index.html"

exclude:
- target
- models
- dists
- tmp.*
- temp
2 changes: 1 addition & 1 deletion config/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Reference `fasta` file:
```
ref: /path/to/hg38.fa
```
Manifest of input sample(s), must have two white-space separated columns: sample name (`sample`) and input bam file path (`bam`). See `config.tbl` for an example.
Manifest of input sample(s), must have two white-space separated columns: sample name (`sample`) and input bam file path (`bam`). See `config.tbl` for an example. The `bam` file must be indexed and aligned to the reference genome in the `ref` option.
```
manifest: config/config.tbl
```
Expand Down
2 changes: 0 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,9 +102,7 @@ The peak results are reported in the file `FDR-peaks/FDR-FIRE-peaks.bed.gz` with
| | {sample}.{median, maximum,minimum}.coverage.txt | Allowed coverage range for analysis |
| fiber-calls | | |
| | FIRE.bed.gz | Every FIRE element in every fiber |
| | model.results.bed.gz | Every FIRE, linker, and nucleosome in every fiber |
| | fire-fibers.bed.gz | bed12 start and end of every fiber |
| | fire-fiber-decorators.bed.gz | decorator file that adds annotations of FIRE elements to the bed12 file |
| FDR-peaks | | |
| | FDR-FIRE-peaks.bed.gz | Fiber-seq peak calls |
| | FDR-wide-peaks.bed.gz | Fiber-seq wide peak calls |
Expand Down
2 changes: 1 addition & 1 deletion fire
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ if [[ "${has_config}" == false ]]; then
fi

# check for executables
for x in snakemake bedToBigBed bedGraphToBigWig; do
for x in snakemake; do
if [[ ! $(type -P "${x}") ]]; then
echo "Error: ${x} not found in PATH, but it is required for FIRE."
exit 1
Expand Down
111 changes: 57 additions & 54 deletions workflow/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -4,60 +4,64 @@ import sys
import os
from snakemake.utils import min_version

min_version("8.4.0")
min_version("8.12.0")

# forces pandas pysam etc to be available in the environment
conda: "envs/runner.yaml"

# Force users to use the same underlying OS via singularity.
# container: "docker://condaforge/mambaforge:23.3.1-1"
# container: "docker://continuumio/miniconda3"


ref = config["ref"]
ref_name = config["ref_name"]
# setup shared functions
include: "rules/common.smk"

dhs = config.get("dhs", workflow.source_path("annotations/GM12878_DHS.bed.gz"))
excludes = config.get("excludes", [])
if ref_name == "hg38" or ref_name == "GRCh38":
files = [
"annotations/hg38.blacklist.ENCFF356LFX.bed.gz",
"annotations/hg38.gap.bed.gz",
"annotations/SDs.merged.hg38.bed.gz",
]
excludes += [workflow.source_path(file) for file in files]

max_t = config.get("max_t", 4)
keep_chrs = config.get("keep_chromosomes", ".*")
# reference genome and reference regions
REF = get_ref()
FAI = get_fai()
REF_NAME = config["ref_name"]
EXCLUDES = get_excludes()
KEEP_CHRS = config.get("keep_chromosomes", ".*")

# coverage requirements
min_coverage = config.get("min_coverage", 4)
coverage_within_n_sd = config.get("coverage_within_n_sd", 5)
MIN_COVERAGE = config.get("min_coverage", 4)
COVERAGE_WITHIN_N_SD = config.get("coverage_within_n_sd", 5)

# sample, haplotype, and chromosome wildcard building
fai = pd.read_csv(f"{ref}.fai", sep="\t", names=["chr", "length", "x", "y", "z"])
default_env = config.get("env", "../envs/env.yaml")
data = pd.read_csv(config["manifest"], sep=r"\s+", comment="#").set_index("sample")
haps = ["all", "hap1", "hap2", "unk"]
not_all = ["hap1", "hap2", "unk"]
not_unk = ["all", "hap1", "hap2"]
min_fire_fdr = config.get("min_fire_fdr", 0.10)
FAI_DF = get_fai_df()
DEFAULT_ENV = config.get("env", "../envs/env.yaml")
MANIFEST = get_manifest()
MIN_FIRE_FDR = config.get("min_fire_fdr", 0.10)

# FDR / peak calling thresholds
max_peak_fdr = config.get("max_peak_fdr", 0.05)
min_per_acc_peak = config.get("min_per_acc_peak", None)
if min_per_acc_peak is None:
min_per_acc_peak = 0.0
MAX_PEAK_FDR = config.get("max_peak_fdr", 0.05)
MIN_PER_ACC_PEAK = config.get("min_per_acc_peak", None)
if MIN_PER_ACC_PEAK is None:
MIN_PER_ACC_PEAK = 0.0
else:
max_peak_fdr = 1.0

MAX_PEAK_FDR = 1.0
MIN_FRAC_ACCESSIBLE = config.get("min_frac_accessible", 0)

# Misc sets of wildcards
haps = ["all", "hap1", "hap2", "unk"]
not_unk = ["all", "hap1", "hap2"]
types = ["fdr", "acc", "link", "nuc"]
types_to_col = {"fdr": 4, "acc": 5, "link": 6, "nuc": 7}

bw_types = ["log_FDR"] # "score", "FDR",
bw_types = bw_types + [f"{t}_H1" for t in bw_types] + [f"{t}_H2" for t in bw_types]
el_types = ["fire", "linker", "nucleosome"]

# developer options
FT_EXE = config.get("ft_exe", "ft")
ONT = config.get("ont", False)
USE_ONT = ""
if ONT:
USE_ONT = " --ont --ml 225 "
MIN_UNRELIABLE_COVERAGE_LEN = config.get("min_unreliable_coverage_len", 50)


include: "rules/common.smk"
include: "rules/apply-model.smk"
include: "rules/coverages.smk"
include: "rules/FDR-peaks.smk"
Expand All @@ -66,13 +70,10 @@ include: "rules/decorated-reads.smk"
include: "rules/track-hub.smk"


chroms = get_chroms()


wildcard_constraints:
chrom="|".join(chroms),
chrom="|".join(get_chroms()),
call="|".join(["msp", "m6a"]),
sm="|".join(data.index),
sm="|".join(MANIFEST.index),
types="|".join(types),
fdr=r"\d+",
hp="|".join(haps),
Expand All @@ -83,39 +84,41 @@ wildcard_constraints:
localrules:
trackhub,


rule all:
input:
# coverage information
expand(rules.genome_bedgraph.output, sm=data.index),
expand(rules.coverage.output, sm=data.index),
expand(rules.exclude_from_shuffle.output, sm=data.index),
expand(rules.unreliable_coverage_regions.output, sm=data.index),
expand(rules.genome_bedgraph.output, sm=MANIFEST.index),
expand(rules.coverage.output, sm=MANIFEST.index),
expand(rules.exclude_from_shuffle.output, sm=MANIFEST.index),
expand(rules.unreliable_coverage_regions.output, sm=MANIFEST.index),
# model results
expand(rules.fire_sites.output, sm=data.index),
expand(rules.fire_sites.output, sm=MANIFEST.index),
# fiber locations
expand(rules.fiber_locations.output, sm=data.index),
expand(rules.filtered_and_shuffled_fiber_locations.output, sm=data.index),
expand(rules.fiber_locations.output, sm=MANIFEST.index),
expand(rules.filtered_and_shuffled_fiber_locations.output, sm=MANIFEST.index),
# coverage of elements
expand(
rules.element_coverages.output,
sm=data.index,
sm=MANIFEST.index,
hp=not_unk,
el_type=el_types,
),
# FDR results
expand(rules.fdr_track.output, sm=data.index),
expand(rules.fdr_peaks_by_fire_elements.output, sm=data.index),
expand(rules.wide_fdr_peaks.output, sm=data.index),
expand(rules.peaks_vs_percent.output, sm=data.index),
expand(rules.one_percent_fdr_peaks.output, sm=data.index),
expand(rules.fdr_track.output, sm=MANIFEST.index),
expand(rules.fdr_peaks_by_fire_elements.output, sm=MANIFEST.index),
expand(rules.wide_fdr_peaks.output, sm=MANIFEST.index),
expand(rules.peaks_vs_percent.output, sm=MANIFEST.index),
expand(rules.one_percent_fdr_peaks.output, sm=MANIFEST.index),
expand(rules.fires_in_peaks.output.txt, sm=MANIFEST.index),
# trackhub
expand(rules.fdr_track_to_bw.output.bw, sm=data.index, col=bw_types),
expand(rules.fdr_peaks_by_fire_elements_to_bb.output.bb, sm=data.index),
expand(rules.percent_accessible.output.bw, hp=not_unk, sm=data.index),
expand(rules.fdr_track_to_bw.output.bw, sm=MANIFEST.index, col=bw_types),
expand(rules.fdr_peaks_by_fire_elements_to_bb.output.bb, sm=MANIFEST.index),
expand(rules.percent_accessible.output.bw, hp=not_unk, sm=MANIFEST.index),
expand(
rules.element_coverages_bw.output.bw,
sm=data.index,
sm=MANIFEST.index,
hp=not_unk,
el_type=el_types,
),
expand(rules.trackhub.output, sm=data.index),
expand(rules.trackhub.output, sm=MANIFEST.index),
6 changes: 3 additions & 3 deletions workflow/envs/env.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@ dependencies:
- samtools==1.19.1
- htslib==1.19.1
- bedtools==2.31
- fibertools-rs>=0.3.9
- bioconda::fibertools-rs==0.5.3
- bioconda::gia
- seqtk
- hck>=0.9.2
- bioawk
Expand All @@ -17,5 +18,4 @@ dependencies:
- parallel
- mosdepth==0.3.7
- bedops
#- ucsc-bedgraphtobigwig
#- ucsc-bedtobigbed
- bioconda::bigtools==0.5.1
2 changes: 1 addition & 1 deletion workflow/envs/python.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ dependencies:
- tqdm
- defopt
- numpy==1.24 # use this version to avoid cxx errors
- numba::numba>=0.56
- numba::numba==0.60
- pandas==1.4 # pandas versions later than this have cxx errors
- pysam==0.21
- htslib==1.19.1
Expand Down
13 changes: 13 additions & 0 deletions workflow/envs/runner.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: runner
channels:
- numba
- conda-forge
- bioconda
- defaults
dependencies:
- tqdm
- numpy
- pandas
- pip
- pip:
- pysam
Loading

0 comments on commit 99d40f1

Please sign in to comment.