Skip to content

Commit

Permalink
Use setuptools-scm for Vamb versioning
Browse files Browse the repository at this point in the history
This tool will automatically set the correct Vamb version based on Git info.
This is useful for several reasons:
* It makes it harder for us to mess up the versions on release
* When testing Vamb, the log file will tell us the exact commit used
  • Loading branch information
jakobnissen committed Nov 27, 2023
1 parent b10b2cc commit a5d947d
Show file tree
Hide file tree
Showing 8 changed files with 41 additions and 138 deletions.
23 changes: 14 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ For more information about the implementation, methodological considerations, an
The Vamb package contains several programs, including three binners:
* __Vamb__: The original binner based on variational autoencoders. [Article](https://doi.org/10.1038/s41587-020-00777-4)
* __Avamb__: An ensemble model based on Vamb and adversarial autoencoders. [Article](https://doi.org/10.1038/s42003-023-05452-3).
Avamb produces better bins than Vamb, but is a more complex and computationally demanding pipeline.
Avamb produces somewhat better bins than Vamb, but is a more complex and computationally demanding pipeline.
See the [Avamb README page](https://github.com/RasmussenLab/avamb/tree/avamb_new/workflow_avamb) for more information.
* __TaxVamb__: A semi-supervised binner that uses taxonomy information from e.g. `mmseqs taxonomy`. [Article still in the works].
TaxVamb produces superior bins, but requires you have run a taxonomic annotation workflow.
Expand All @@ -30,7 +30,7 @@ pip install vamb
:bangbang: An active Conda environment can hijack your system's linker, causing an error during installation. Either deactivate `conda`, or delete the `~/miniconda/compiler_compats` directory before installing with pip.

Alternatively, it can be installed as a [Bioconda's package](https://anaconda.org/bioconda/vamb) (thanks to contribution from Antônio Pedro Camargo).
The BioConda package does not include GPU support.
Currently, the Conda version is severely outdated, so we recommend installing using pip. Also, the BioConda package does not include GPU support.

```
conda install -c pytorch pytorch torchvision cudatoolkit=10.2
Expand All @@ -55,10 +55,10 @@ If you can't/don't want to use pip/Conda, you can do it the hard way: Install th

# Running Vamb
First, figure out what program you want to run:
* If you want a decent and simple binner, run `vamb bin default`
* If you want to bin, and don't mind a more complex, but performant workflow run the [Avamb Snakemake workflow](https://github.com/RasmussenLab/avamb/tree/avamb_new/workflow_avamb)
* If you want to bin, and is able to get taxonomic information, run `vamb bin taxvamb`
* If you want to refine existing taxonomic classification, run `vamb taxometer`
* If you want to bin, and is able to get taxonomic information, run `vamb bin taxvamb`
* If you want to bin, and don't mind a more complex, but performant workflow run the [Avamb Snakemake workflow](https://github.com/RasmussenLab/avamb/tree/avamb_new/workflow_avamb)
* If you want a decent and simple binner, run `vamb bin default`

For more command-line options, see the command-line help menu:
```
Expand Down Expand Up @@ -100,11 +100,16 @@ minimap2 -t 8 -N 5 -ax sr catalogue.mmi --split-prefix mmsplit /path/to/reads/sa
4. Run Vamb:

```
vamb bin default --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C
vamb bin basic --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C
```

5. Apply any desired postprocessing to Vamb's output.

## How to run: Using the Vamb Snakemake workflow
To make it even easier to run Vamb, we have created a [Snakemake](https://snakemake.readthedocs.io/en/stable/#) workflow.
This workflow runs steps 2-5 above using `minimap2` to align, and [CheckM](https://ecogenomics.github.io/CheckM/) to estimate completeness and contamination of the resulting bins.
The workflow can run both on a local machine, a workstation and a HPC system using `qsub`. It can be found in the `workflow` folder - see the file `workflow/README.md` for details.

# Detailed user instructions
See the tutorial in `doc/tutorial.md` for even more detailed instructions.

Expand Down Expand Up @@ -182,7 +187,7 @@ __5) Run Vamb__
By default, Vamb does not output any FASTA files of the bins. In the examples below, the option `--minfasta 200000` is set, meaning that all bins with a size of 200 kbp or more will be output as FASTA files.
Run Vamb with:

`vamb bin default -o SEP --outdir OUT --fasta FASTA --bamfiles BAM1 BAM2 [...] --minfasta 200000`,
`vamb bin basic -o SEP --outdir OUT --fasta FASTA --bamfiles BAM1 BAM2 [...] --minfasta 200000`,

where `SEP` in the {Separator} chosen in step 3, e.g. `C` in that example, `OUT` is the name of the output directory to create, `FASTA` the path to the FASTA file and `BAM1` the path to the first BAM file. You can also use shell globbing to input multiple BAM files: `my_bamdir/*bam`.

Expand All @@ -197,8 +202,8 @@ Vamb will bin every input contig. Contigs that cannot be binned with other conti
The default hyperparameters of Vamb will provide good performance on any dataset. However, since running Vamb is fast (especially using GPUs) it is possible to try to run Vamb with different hyperparameters to see if better performance can be achieved (note that here we measure performance as the number of near-complete bins assessed by CheckM). We recommend to try to increase and decrease the size of the neural network and have used Vamb on datasets where increasing the network resulted in more near-complete bins and other datasets where decreasing the network resulted in more near-complete bins. To do this you can run Vamb as (default for multiple samples is `-l 32 -n 512 512`):

```
vamb bin default -l 24 -n 384 384 --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C --minfasta 200000
vamb bin default -l 40 -n 768 768 --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C --minfasta 200000
vamb bin basic -l 24 -n 384 384 --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C --minfasta 200000
vamb bin basic -l 40 -n 768 768 --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C --minfasta 200000
```

It is possible to try any combination of latent and hidden neurons as well as other sizes of the layers. Number of near-complete bins can be assessed using CheckM and compared between the methods. Potentially see the snakemake folder `workflow` for an automated way to run Vamb with multiple parameters.
10 changes: 7 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,14 @@ authors = [
url = "https://github.com/RasmussenLab/vamb"
description = "Variational and Adversarial autoencoders for Metagenomic Binning"
license = "MIT"
[tool.setuptools.dynamic]
version = {attr = "vamb.__version__"}
readme = {file = "README.md"}

[build-system]
requires = ["setuptools ~= 63.0", "Cython ~= 0.29.5"]
requires = [
"setuptools ~= 64.0",
"setuptools-scm >= 8.0",
"Cython ~= 0.29.5"
]
build-backend = "setuptools.build_meta"

[tool.ruff]
Expand All @@ -43,3 +45,5 @@ filterwarnings = [
"error",
"ignore::UserWarning",
]

[tool.setuptools_scm]
102 changes: 0 additions & 102 deletions test/ci.py

This file was deleted.

10 changes: 0 additions & 10 deletions test/test_parsecontigs.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,16 +31,6 @@ def setUp(self):
self.io.seek(0)
self.large_io.seek(0)

def test_only_ns(self):
file = io.BytesIO()
file.write(b">abc\n")
file.write(b"N" * 2500)
file.write(b"\n")
file.seek(0)

with self.assertRaises(ValueError):
Composition.from_file(file)

def test_unique_names(self):
with self.assertRaises(ValueError):
CompositionMetaData(
Expand Down
4 changes: 2 additions & 2 deletions vamb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@
Documentation: https://github.com/RasmussenLab/vamb/
"""

__version__ = (4, 1, 3)

from . import vambtools
from . import parsebam
from . import parsecontigs
Expand All @@ -15,8 +13,10 @@
from . import taxvamb_encode
from . import reclustering

from importlib.metadata import version as get_version
from loguru import logger

__version_str__ = get_version("vamb")
logger.remove()

__all__ = [
Expand Down
6 changes: 3 additions & 3 deletions vamb/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1413,7 +1413,7 @@ def run(self):
)
logger.add(sys.stderr, format=format_log)
begintime = time.time()
logger.info("Starting Vamb version " + ".".join(map(str, vamb.__version__)))
logger.info("Starting Vamb version " + vamb.__version_str__)
logger.info("Random seed is " + str(self.vamb_options.seed))
self.run_inner()
logger.info(f"Completed Vamb in {round(time.time() - begintime, 2)} seconds.")
Expand Down Expand Up @@ -2071,7 +2071,7 @@ def add_reclustering_arguments(subparser):

def main():
doc = f"""
Version: {'.'.join([str(i) for i in vamb.__version__])}
Version: {vamb.__version_str__}
Default use, good for most datasets:
vamb bin default --outdir out --fasta my_contigs.fna --bamfiles *.bam -o C
Expand All @@ -2091,7 +2091,7 @@ def main():
helpos.add_argument(
"--version",
action="version",
version=f'Vamb {".".join(map(str, vamb.__version__))}',
version=f"Vamb {vamb.__version_str__}",
)

if len(sys.argv) == 1:
Expand Down
11 changes: 11 additions & 0 deletions vamb/encode.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,17 @@ def make_dataloader(
"One or more samples have zero depth in all sequences, so cannot be depth normalized"
)
rpkm *= 1_000_000 / sample_depths_sum

zero_tnf = tnf.sum(axis=1) == 0
smallest_index = _np.argmax(zero_tnf)
if zero_tnf[smallest_index]:
raise ValueError(
f"TNF row at index {smallest_index} is all zeros. "
+ "This implies that the sequence contained no 4-mers of A, C, G, T or U, "
+ "making this sequence uninformative. This is probably a mistake. "
+ "Verify that the sequence contains usable information (e.g. is not all N's)"
)

total_abundance = rpkm.sum(axis=1)

# Normalize rpkm to sum to 1
Expand Down
13 changes: 4 additions & 9 deletions vamb/parsecontigs.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,25 +178,20 @@ def from_file(
lengths = _vambtools.PushArray(_np.int32)
mask = bytearray() # we convert to Numpy at end
contignames: list[str] = list()
minimum_seen_length = 2_000_000_000

entries = _vambtools.byte_iterfasta(filehandle)

for entry in entries:
length = len(entry)
minimum_seen_length = min(minimum_seen_length, length)
skip = length < minlength
mask.append(not skip)

if skip:
continue

counts = entry.kmercounts(4)
if counts.sum() == 0:
raise ValueError(
f'TNF value of contig "{entry.header}" is all zeros. '
+ "This implies that the sequence contained no 4-mers of A, C, G, T or U, "
+ "making this sequence uninformative. This is probably a mistake. "
+ "Verify that the sequence contains usable information (e.g. is not all N's)"
)
raw.extend(counts)
raw.extend(entry.kmercounts(4))

if len(raw) > 256000:
Composition._convert(raw, projected)
Expand Down

0 comments on commit a5d947d

Please sign in to comment.