Use setuptools-scm for Vamb versioning

This tool will automatically set the correct Vamb version based on Git info. This is useful for several reasons: * It makes it harder for us to mess up the versions on release * When testing Vamb, the log file will tell us the exact commit used
RasmussenLab · Nov 27, 2023 · a5d947d · a5d947d
1 parent b10b2cc
commit a5d947d
Show file tree

Hide file tree

Showing 8 changed files with 41 additions and 138 deletions.
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ For more information about the implementation, methodological considerations, an
 The Vamb package contains several programs, including three binners:
 * __Vamb__: The original binner based on variational autoencoders. [Article](https://doi.org/10.1038/s41587-020-00777-4)
 * __Avamb__: An ensemble model based on Vamb and adversarial autoencoders. [Article](https://doi.org/10.1038/s42003-023-05452-3).
-  Avamb produces better bins than Vamb, but is a more complex and computationally demanding pipeline.
+  Avamb produces somewhat better bins than Vamb, but is a more complex and computationally demanding pipeline.
   See the [Avamb README page](https://github.com/RasmussenLab/avamb/tree/avamb_new/workflow_avamb) for more information.
 * __TaxVamb__: A semi-supervised binner that uses taxonomy information from e.g. `mmseqs taxonomy`. [Article still in the works].
   TaxVamb produces superior bins, but requires you have run a taxonomic annotation workflow.
@@ -30,7 +30,7 @@ pip install vamb
 :bangbang: An active Conda environment can hijack your system's linker, causing an error during installation. Either deactivate `conda`, or delete the `~/miniconda/compiler_compats` directory before installing with pip.
 
 Alternatively, it can be installed as a [Bioconda's package](https://anaconda.org/bioconda/vamb) (thanks to contribution from Antônio Pedro Camargo).
-The BioConda package does not include GPU support.
+Currently, the Conda version is severely outdated, so we recommend installing using pip. Also, the BioConda package does not include GPU support.
 
 ```
 conda install -c pytorch pytorch torchvision cudatoolkit=10.2
@@ -55,10 +55,10 @@ If you can't/don't want to use pip/Conda, you can do it the hard way: Install th
 
 # Running Vamb
 First, figure out what program you want to run:
-* If you want a decent and simple binner, run `vamb bin default`
-* If you want to bin, and don't mind a more complex, but performant workflow run the [Avamb Snakemake workflow](https://github.com/RasmussenLab/avamb/tree/avamb_new/workflow_avamb)
-* If you want to bin, and is able to get taxonomic information, run `vamb bin taxvamb`
 * If you want to refine existing taxonomic classification, run `vamb taxometer`
+* If you want to bin, and is able to get taxonomic information, run `vamb bin taxvamb`
+* If you want to bin, and don't mind a more complex, but performant workflow run the [Avamb Snakemake workflow](https://github.com/RasmussenLab/avamb/tree/avamb_new/workflow_avamb)
+* If you want a decent and simple binner, run `vamb bin default`
 
 For more command-line options, see the command-line help menu:
 ```
@@ -100,11 +100,16 @@ minimap2 -t 8 -N 5 -ax sr catalogue.mmi --split-prefix mmsplit /path/to/reads/sa
 4. Run Vamb:
 
 ```
-vamb bin default --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C
+vamb bin basic --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C
 ```
 
 5. Apply any desired postprocessing to Vamb's output.
 
+## How to run: Using the Vamb Snakemake workflow
+To make it even easier to run Vamb, we have created a [Snakemake](https://snakemake.readthedocs.io/en/stable/#) workflow.
+This workflow runs steps 2-5 above using `minimap2` to align, and [CheckM](https://ecogenomics.github.io/CheckM/) to estimate completeness and contamination of the resulting bins.
+The workflow can run both on a local machine, a workstation and a HPC system using `qsub`. It can be found in the `workflow` folder - see the file `workflow/README.md` for details.
+
 # Detailed user instructions
 See the tutorial in `doc/tutorial.md` for even more detailed instructions.
 
@@ -182,7 +187,7 @@ __5) Run Vamb__
 By default, Vamb does not output any FASTA files of the bins. In the examples below, the option `--minfasta 200000` is set, meaning that all bins with a size of 200 kbp or more will be output as FASTA files.
 Run Vamb with:
 
-`vamb bin default -o SEP --outdir OUT --fasta FASTA --bamfiles BAM1 BAM2 [...] --minfasta 200000`,
+`vamb bin basic -o SEP --outdir OUT --fasta FASTA --bamfiles BAM1 BAM2 [...] --minfasta 200000`,
 
 where `SEP` in the {Separator} chosen in step 3, e.g. `C` in that example, `OUT` is the name of the output directory to create, `FASTA` the path to the FASTA file and `BAM1` the path to the first BAM file. You can also use shell globbing to input multiple BAM files: `my_bamdir/*bam`.
 
@@ -197,8 +202,8 @@ Vamb will bin every input contig. Contigs that cannot be binned with other conti
 The default hyperparameters of Vamb will provide good performance on any dataset. However, since running Vamb is fast (especially using GPUs) it is possible to try to run Vamb with different hyperparameters to see if better performance can be achieved (note that here we measure performance as the number of near-complete bins assessed by CheckM). We recommend to try to increase and decrease the size of the neural network and have used Vamb on datasets where increasing the network resulted in more near-complete bins and other datasets where decreasing the network resulted in more near-complete bins. To do this you can run Vamb as (default for multiple samples is `-l 32 -n 512 512`):
 
 ```
-vamb bin default -l 24 -n 384 384 --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C --minfasta 200000
-vamb bin default -l 40 -n 768 768 --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C --minfasta 200000
+vamb bin basic -l 24 -n 384 384 --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C --minfasta 200000
+vamb bin basic -l 40 -n 768 768 --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C --minfasta 200000
 ```
 
 It is possible to try any combination of latent and hidden neurons as well as other sizes of the layers. Number of near-complete bins can be assessed using CheckM and compared between the methods. Potentially see the snakemake folder `workflow` for an automated way to run Vamb with multiple parameters.
diff --git a/pyproject.toml b/pyproject.toml
@@ -26,12 +26,14 @@ authors = [
 url = "https://github.com/RasmussenLab/vamb"
 description = "Variational and Adversarial autoencoders for Metagenomic Binning"
 license = "MIT"
-[tool.setuptools.dynamic]
-version = {attr = "vamb.__version__"}
 readme = {file = "README.md"}
 
 [build-system]
-requires = ["setuptools ~= 63.0", "Cython ~= 0.29.5"]
+requires = [
+    "setuptools ~= 64.0",
+    "setuptools-scm >= 8.0",
+    "Cython ~= 0.29.5"
+]
 build-backend = "setuptools.build_meta"
 
 [tool.ruff]
@@ -43,3 +45,5 @@ filterwarnings = [
     "error",
     "ignore::UserWarning",
 ]
+
+[tool.setuptools_scm]
diff --git a/test/ci.py b/test/ci.py
diff --git a/test/test_parsecontigs.py b/test/test_parsecontigs.py
@@ -31,16 +31,6 @@ def setUp(self):
         self.io.seek(0)
         self.large_io.seek(0)
 
-    def test_only_ns(self):
-        file = io.BytesIO()
-        file.write(b">abc\n")
-        file.write(b"N" * 2500)
-        file.write(b"\n")
-        file.seek(0)
-
-        with self.assertRaises(ValueError):
-            Composition.from_file(file)
-
     def test_unique_names(self):
         with self.assertRaises(ValueError):
             CompositionMetaData(

diff --git a/vamb/__init__.py b/vamb/__init__.py
@@ -2,8 +2,6 @@
 Documentation: https://github.com/RasmussenLab/vamb/
 """
 
-__version__ = (4, 1, 3)
-
 from . import vambtools
 from . import parsebam
 from . import parsecontigs
@@ -15,8 +13,10 @@
 from . import taxvamb_encode
 from . import reclustering
 
+from importlib.metadata import version as get_version
 from loguru import logger
 
+__version_str__ = get_version("vamb")
 logger.remove()
 
 __all__ = [

diff --git a/vamb/__main__.py b/vamb/__main__.py
@@ -1413,7 +1413,7 @@ def run(self):
         )
         logger.add(sys.stderr, format=format_log)
         begintime = time.time()
-        logger.info("Starting Vamb version " + ".".join(map(str, vamb.__version__)))
+        logger.info("Starting Vamb version " + vamb.__version_str__)
         logger.info("Random seed is " + str(self.vamb_options.seed))
         self.run_inner()
         logger.info(f"Completed Vamb in {round(time.time() - begintime, 2)} seconds.")
@@ -2071,7 +2071,7 @@ def add_reclustering_arguments(subparser):
 
 def main():
     doc = f"""
-    Version: {'.'.join([str(i) for i in vamb.__version__])}
+    Version: {vamb.__version_str__}
 
     Default use, good for most datasets:
     vamb bin default --outdir out --fasta my_contigs.fna --bamfiles *.bam -o C
@@ -2091,7 +2091,7 @@ def main():
     helpos.add_argument(
         "--version",
         action="version",
-        version=f'Vamb {".".join(map(str, vamb.__version__))}',
+        version=f"Vamb {vamb.__version_str__}",
     )
 
     if len(sys.argv) == 1:

diff --git a/vamb/encode.py b/vamb/encode.py
@@ -99,6 +99,17 @@ def make_dataloader(
             "One or more samples have zero depth in all sequences, so cannot be depth normalized"
         )
     rpkm *= 1_000_000 / sample_depths_sum
+
+    zero_tnf = tnf.sum(axis=1) == 0
+    smallest_index = _np.argmax(zero_tnf)
+    if zero_tnf[smallest_index]:
+        raise ValueError(
+            f"TNF row at index {smallest_index} is all zeros. "
+            + "This implies that the sequence contained no 4-mers of A, C, G, T or U, "
+            + "making this sequence uninformative. This is probably a mistake. "
+            + "Verify that the sequence contains usable information (e.g. is not all N's)"
+        )
+
     total_abundance = rpkm.sum(axis=1)
 
     # Normalize rpkm to sum to 1

diff --git a/vamb/parsecontigs.py b/vamb/parsecontigs.py
@@ -178,25 +178,20 @@ def from_file(
         lengths = _vambtools.PushArray(_np.int32)
         mask = bytearray()  # we convert to Numpy at end
         contignames: list[str] = list()
+        minimum_seen_length = 2_000_000_000
+
         entries = _vambtools.byte_iterfasta(filehandle)
 
         for entry in entries:
             length = len(entry)
+            minimum_seen_length = min(minimum_seen_length, length)
             skip = length < minlength
             mask.append(not skip)
 
             if skip:
                 continue
 
-            counts = entry.kmercounts(4)
-            if counts.sum() == 0:
-                raise ValueError(
-                    f'TNF value of contig "{entry.header}" is all zeros. '
-                    + "This implies that the sequence contained no 4-mers of A, C, G, T or U, "
-                    + "making this sequence uninformative. This is probably a mistake. "
-                    + "Verify that the sequence contains usable information (e.g. is not all N's)"
-                )
-            raw.extend(counts)
+            raw.extend(entry.kmercounts(4))
 
             if len(raw) > 256000:
                 Composition._convert(raw, projected)