The pipeline implements a specialised Variational Autoencoder (VAE) for genomic sequence analysis, building upon the architecture described in Weber (2024) [1] for contamination detection in long-read sequencing data. While maintaining the core VAE principles from Weber's work, this implementation includes several modifications for improved scalability and performance, including dynamic layer sizing, memory-efficient processing, and enhanced visualisation capabilities through Uniform Manifold Approximation and Projection (UMAP) integration.
- Dynamic architecture that adapts to input data dimensions
- Emerging clusters in the latent space are likely to reflect taxonomic relationships, aiding downstream analysis
- Memory-efficient processing for large genomic datasets
- Integration with UMAP for improved visualisation
- Comprehensive quality monitoring and validation
- Accepts normalised k-mer frequency vectors (typically 16,384 dimensions for 7-mers)
- Implements Term Frequency-Inverse Document Frequency (TF-IDF) transformation to emphasise distinctive k-mer patterns
- Includes batch normalisation [3] for stable training
- Uses length normalisation to account for varying sequence sizes
- Dynamic sizing: intermediate layers automatically scale based on input dimensions
- Three main dense layers with decreasing dimensions (e.g., 2048 → 512 → 128)
- Residual connections to preserve important sequence features
- Regularisation through dropout [4] (10%) and L1 regularisation to prevent overfitting
- Batch normalisation after each major layer
- Default 32 dimensions (adjustable via
n_components
parameter) - Uses reparameterisation trick [5] with learned mean and variance
- Implements Kullback-Leibler (KL) divergence warm-up to prevent posterior collapse
- The learned latent space is likely to reflect taxonomic relationships between sequences, aiding downstream analysis
- Mirror of encoder architecture with transposed dimensions
- Additional regularisation to ensure non-negative outputs
- Residual connections matching encoder structure
- Final sigmoid activation to reconstruct normalised k-mer frequencies between 0-1
The autoencoder supports multiple activation functions for its dense layers:
- ReLU (Rectified Linear Unit): A widely used default activation, effective in preventing vanishing gradients
- SeLU (Scaled Exponential Linear Unit): Self-normalising, suitable for deeper networks
- Tanh: Maps inputs to [-1,1], useful for symmetric distributions
- Sigmoid: Compresses outputs between [0,1], potentially limiting gradient updates
- Linear: No transformation, mainly used for debugging
Empirical testing has shown that ReLU is the recommended default choice, with SeLU also performing well. Running all activation modes is generally unnecessary; ReLU is sufficient for most use cases.
- Non-negativity constraints suited for k-mer counts
- Dynamic architecture sizing based on input k-mer dimensionality
- Standard mean squared error reconstruction loss with KL divergence weighted by adaptive β scheduling
- Batch processing to handle large genome datasets
- Memory requirements scale with dataset size (default 4GB, user-adjustable)
- Modified KL divergence scheduling for stable training
- Custom early stopping based on validation loss
- Integration with taxonomic analysis workflows
- Preservation of important genomic signals
- Robust handling of varying sequence lengths
- Uses an adaptive learning rate strategy with Adam optimiser [6]
- Implements early stopping to prevent overfitting
- Monitors reconstruction quality and latent space structure
- Automatically adjusts KL divergence weight
- Latent space dimensions adjustable based on dataset complexity
- Batch size and epoch count adapt to dataset size
- Automatic prevention of common VAE issues like posterior collapse
- Comprehensive monitoring tools for performance tracking
Four key training metrics:
- Total Loss: Combined reconstruction and KL divergence loss
- Reconstruction Loss: Measures accuracy of sequence reconstruction
- KL Divergence Loss: Measures latent space distribution
- Learning Rate: Shows adaptive learning rate behaviour
Generated for both raw VAE embeddings and UMAP transformations:
- Main scatter plot of the 2D embedding
- Density view showing point concentrations
- Distribution plots for first two components
Detailed per-epoch metrics including:
- Total loss
- Reconstruction loss
- KL divergence loss
- Validation metrics
- Learning rate
Files containing embedding quality measures:
- Trustworthiness: Measures how well local neighbourhoods are preserved in the embedding
- Values range from 0 to 1
- Values >0.9: Excellent preservation of local structure
- Values 0.7-0.9: Good preservation, suitable for most analyses
- Values <0.7: Poor preservation, consider adjusting parameters or using a different method
- Continuity: Measures whether points that are close in the original space remain close in the embedding
- Values range from 0 to 1
- Values >0.9: Strong preservation of original relationships
- Values 0.7-0.9: Acceptable preservation of relationships
- Values <0.7: Significant distortion, may not be reliable for analysis
- Both metrics should ideally be >0.7 for reliable embeddings
- If either metric is consistently low, consider:
- Adjusting the number of components
- Using a different activation function
- Trying a different dimensionality reduction method
autoencoder_[activation mode]_raw_coordinates.csv
: Raw embeddings from the autoencoder, where the file name reflects the activation mode usedkmers_dim_reduction_embeddings.csv
: Final embeddings combining VAE latent space with UMAP projection, optimised for species separation. UMAP optimisation occurs when--skip_n_neighbors_optimisation
isn't used (default).
- Processes large genomic datasets through batch processing
- Memory usage increases with dataset size and k-mer length
- Default memory limit of 4GB (adjustable via configuration)
- Automatically scales intermediate layer dimensions
- Uses data normalisation strategies optimised for k-mer frequency data
- Maintains numerical stability through appropriate regularisation
- For datasets with <100 sequences, autoencoder components automatically reduce to sample_count-1
- Warning: Large datasets may require significant memory resources
- Combines VAE's dimensionality reduction with UMAP's visualisation
- Preserves both local and global structure
- Enables interactive exploration of sequence relationships
- Provides options for different visualisation schemes
While traditional GC-coverage plots provide a basic view of sequence composition, k-mer-based embeddings often reveal more nuanced taxonomic relationships. The first principal component (PC1) frequently correlates with GC content, but additional components and non-linear transformations can expose more subtle patterns in sequence composition that aid in taxonomic classification.
- Separating sequences from different organisms in mixed samples
- Detecting contamination in sequencing data
- Exploring relationships between sequences
- Identifying unusual or outlier sequences
- Very similar sequences might not be completely separated
- Processing time increases with dataset size and k-mer length
- Memory requirements scale with input dimensions
- Training time can be significant for large datasets
- Weber, Claudia C. "Disentangling cobionts and contamination in long-read genomic data using sequence composition." G3 Genes|Genomes|Genetics, 2024.
- Hinton GE, Salakhutdinov RR. "Reducing the Dimensionality of Data with Neural Networks." Science, 2006.
- Ioffe S, Szegedy C. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." arXiv:1502.03167, 2015.
- Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." Journal of Machine Learning Research, 2014.
- Kingma DP, Welling M. "Auto-Encoding Variational Bayes." arXiv:1312.6114, 2013.
- Kingma DP, Ba J. "Adam: A Method for Stochastic Optimization." arXiv:1412.6980, 2014.