Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable background subtraction / file unzipping #118

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open

Commits on Oct 13, 2023

  1. fix: glob all files without exclusion of bg. fix bg getting

    changelog:
    - All files are streamed to input files, rather than just files without associated background files.
    - Background filenames (stripped of extensions) are no longer part of the input stream for `rules.score`, preventing odd errors
    - Updated associated files to pull all input files as desired.
    christinehc committed Oct 13, 2023
    Configuration menu
    Copy the full SHA
    58591e6 View commit details
    Browse the repository at this point in the history

Commits on Nov 3, 2023

  1. [!broken!] fix: enable unzip, redo file glob, add background

    changelog:
    - NOTE THAT WORKFLOW IS CURRENTLY BROKEN DUE TO SNAKEMAKE I/O REASONS AND I AM COMMITTING INTERRIM CHANGES
    - fix: redo file glob -- file globbing now proceeds through `glob_wildcards` to more cleanly grab input files
    - fix: enable unzip -- unzipping has been overhauled (these are forward changes adapted from snekmer 2.0.0 / the biotite-kmers branch).
    - fix: add background -- changes have been made to collate background files and use their kmer distribution to subtract a background from protein family kmer models. These fixes work piece-by-piece locally but have not been fully tested and may not work ideally yet.
    christinehc committed Nov 3, 2023
    Configuration menu
    Copy the full SHA
    8c0f312 View commit details
    Browse the repository at this point in the history

Commits on Nov 7, 2023

  1. [!broken!] build: tweak workflow to attempt snakemake debug

    - Note: changes did NOT work, hence the "broken" tag.
    christinehc committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    be53b22 View commit details
    Browse the repository at this point in the history

Commits on Nov 8, 2023

  1. [!broken!] fix: pipe background i/o, update filenames

    changelog:
    - snakemake now correctly builds DAG for background workflow, including file unzipping
    - some files have been renamed for simplicity
    - some instances of `skm.io.load_npz` have been replaced with `np.load` due to KeyError (perhaps due to numpy or pickle version?)
    - `rules.combine_background` now uses kmer basis set for each family to reshape each background vector. should make files smaller and workflow more compact
    - NOTE: WORKFLOW IS BROKEN AT `rules.score_with_background` due to file load / array shape issues that will be fixed in the next commit.
    - addresses #37
    christinehc committed Nov 8, 2023
    Configuration menu
    Copy the full SHA
    7809d10 View commit details
    Browse the repository at this point in the history

Commits on Nov 28, 2023

  1. feat: update kmer probability scoring for background subtract

    changelog:
    - kmer probability scoring using background subtraction is now the default scoring method
    - `snekmer.score.feature_class_probabilities` now performs either background subtraction based scoring, family label based scoring, or a combination thereof depending on user input
    - TODO: integration with `snekmer.score.KmerScorer` object
    christinehc committed Nov 28, 2023
    Configuration menu
    Copy the full SHA
    3e7b10d View commit details
    Browse the repository at this point in the history

Commits on Dec 12, 2023

  1. chore: update config, tick version, and do file cleanup

    changelog:
    - new config parameter config['score']['method'] added for compatibility with additional new(!) scoring methods
    - uptick version from v1.1.1 -> v1.4.0
      - upticked +3 minor versions in anticipation of two pending PRs
    - remove no longer needed files
    christinehc committed Dec 12, 2023
    Configuration menu
    Copy the full SHA
    f73f17e View commit details
    Browse the repository at this point in the history
  2. feat: enable kmer scoring via background subtraction (fixes #37)

    changelog:
    - kmers can now be scored by probability score subtracting the observed kmers in a supplied background set, family set, or combining both background and family
      - note: some column headers have changed, which may affect downstream analysis (e.g. integration with #115, #116)
    - to handle user-supplied background files, new rules have been created to count background kmers and combine background kmer counts into a background matrix. The appropriate files for the new workflow have been created.
    - extensive changes have been made to `snekmer.score` to accommodate the new changes, including:
      - `snekmer.score.score` now has 3 distinct formulae to compute probability scores according to the desired scoring method
      - `snekmer.score.feature_class_probabilities` now also integrates the scoring method
    - the main scoring rule itself has been significantly altered as follows"
      - all references to the old and not-working "background subtraction" (e.g. separating sequences by "sample" or "background" labels) have been removed
      - extraneous kmer probability scores for every family are no longer calculated; only the family in question's kmer profile is scored
      - scoring method now integrated
    christinehc committed Dec 12, 2023
    Configuration menu
    Copy the full SHA
    f394fee View commit details
    Browse the repository at this point in the history

Commits on Dec 21, 2023

  1. Configuration menu
    Copy the full SHA
    4ee5d4a View commit details
    Browse the repository at this point in the history

Commits on Dec 22, 2023

  1. fix: rework default filename parsing

    changelog:
    - fix: `snekmer.utils.get_family` now accepts `regex=None` by default as to not erroneously truncate filenames.
    - fix: small change to `snekmer.utils.get_family` to correctly identify directories.
    - refactor: overhaul `snekmer.utils.split_file_ext` to split at the point of an .faa, .fa, .fna, or .fasta extension instead of assuming at most 2 potential extensions
    christinehc committed Dec 22, 2023
    Configuration menu
    Copy the full SHA
    06e21d9 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    1344d93 View commit details
    Browse the repository at this point in the history

Commits on Jan 5, 2024

  1. Configuration menu
    Copy the full SHA
    35e0b49 View commit details
    Browse the repository at this point in the history
  2. refactor: deprecate process.smk for file unzipping

    changelog:
    - file unzipping is now handled by top-level unzip code in each snakefile; thus, `process.smk` is outdated and has been deleted as it is no longer needed.
    christinehc committed Jan 5, 2024
    Configuration menu
    Copy the full SHA
    2625da7 View commit details
    Browse the repository at this point in the history
  3. refactor: apply file wildcard globbing changes to cluster,search

    changelog:
    - file wildcard globbing previously proceeded through `glob.glob`, but had been updated in the model workflow to use snakemake's `glob_wildcards` utility. This method has the added benefit of preventing recursion errors with wildcard retrieval from gzipped files. The changes have now been applied to cluster and search workflows.
    christinehc committed Jan 5, 2024
    Configuration menu
    Copy the full SHA
    3309880 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    85a009f View commit details
    Browse the repository at this point in the history

Commits on Jan 8, 2024

  1. fix,refactor: repair cluster mode unzip and file globbing

    changelog:
    - refactor: move `cluster_cluster.py` -> `cluster.py`
    - refactor: move cluster report generation to separate script directive
    - fix: change cluster mode file globbing to mirror model mode changes, i.e. uses snakemake `glob_wildcards` instead of python `glob.glob`. This should also fix unzipping issues and recursion errors related to unzipping.
    christinehc committed Jan 8, 2024
    Configuration menu
    Copy the full SHA
    4c9a71d View commit details
    Browse the repository at this point in the history

Commits on Jan 9, 2024

  1. fix,style: update search file glob. apply snakefmt

    changelog:
    - fix: search file globbing updated to use snakemake's `glob_wildcards` rather than python's `glob.glob` in search mode. Should also resolve issues with file detection for files requiring unzipping and avoid recursion errors. Tested locally with a small subset of small families.
    - style: applied snakefmt to `cluster.smk` and `search.smk`
    christinehc committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    60fb8d2 View commit details
    Browse the repository at this point in the history

Commits on Jan 30, 2024

  1. Configuration menu
    Copy the full SHA
    6a8c042 View commit details
    Browse the repository at this point in the history

Commits on Jan 31, 2024

  1. Configuration menu
    Copy the full SHA
    abaac01 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    15421ad View commit details
    Browse the repository at this point in the history

Commits on Feb 1, 2024

  1. Configuration menu
    Copy the full SHA
    5ed97ad View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    270f1d9 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    43a8ccf View commit details
    Browse the repository at this point in the history

Commits on Feb 7, 2024

  1. feat,refactor: add --resources flag to CLI and streamline

    changelog:
    - feat: Snakemake `--resources` flag has been added to Snekmer CLI for all modes and tested locally.
    - refactor: Wrapped all snakemake command line arguments into dictionary which is now passed to all snekmer subcommands. Removes the redundancy in specifying the same command line arguments every time a subcommand is called.
    christinehc committed Feb 7, 2024
    Configuration menu
    Copy the full SHA
    2d72548 View commit details
    Browse the repository at this point in the history
  2. build: uptick version

    christinehc committed Feb 7, 2024
    Configuration menu
    Copy the full SHA
    5e39a7f View commit details
    Browse the repository at this point in the history

Commits on Feb 20, 2024

  1. fix,refactor: resolve array shapes. streamline code

    changelog:
    - fix: resolve error with array shapes due to matrix dimensions (transpose matrix required)
    - refactor: renamed variables to streamline code
    christinehc committed Feb 20, 2024
    Configuration menu
    Copy the full SHA
    20023fd View commit details
    Browse the repository at this point in the history

Commits on Mar 13, 2024

  1. Configuration menu
    Copy the full SHA
    ba6df73 View commit details
    Browse the repository at this point in the history

Commits on Apr 16, 2024

  1. Configuration menu
    Copy the full SHA
    710d2ba View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    679af3d View commit details
    Browse the repository at this point in the history

Commits on May 7, 2024

  1. fix: resolve array shape mismatches

    changelog:
    - basis harmonization now accounts for either 1D or 2D array cases
    - 1D arrays are explicitly handled to match expected shape parameters set by the assumption that input arrays are 2D
    - `utils.check_n_seqs` now uses boolean input arg to handle gz files rather than inferring from filename
    christinehc committed May 7, 2024
    Configuration menu
    Copy the full SHA
    31d3707 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    c40e09f View commit details
    Browse the repository at this point in the history

Commits on May 8, 2024

  1. fix: verify bg file presence for all modes. bypass unicode error

    changelog:
    - Workflow now accounts for cases where no background files are included when either "combined" or "background" mode are selected. (TODO: raise warning in this case)
    - Bypass UnicodeDecodeError for `utils.check_n_seqs`
    christinehc committed May 8, 2024
    Configuration menu
    Copy the full SHA
    916806e View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    5d3ffac View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    22dd1c5 View commit details
    Browse the repository at this point in the history

Commits on May 10, 2024

  1. Configuration menu
    Copy the full SHA
    163239c View commit details
    Browse the repository at this point in the history

Commits on May 28, 2024

  1. Configuration menu
    Copy the full SHA
    0de98d7 View commit details
    Browse the repository at this point in the history