Variational Autoencoder Implementation for K-mer Analysis

Overview

The pipeline implements a specialised Variational Autoencoder (VAE) for genomic sequence analysis, building upon the architecture described in Weber (2024) [1] for contamination detection in long-read sequencing data. While maintaining the core VAE principles from Weber's work, this implementation includes several modifications for improved scalability and performance, including dynamic layer sizing, memory-efficient processing, and enhanced visualisation capabilities through Uniform Manifold Approximation and Projection (UMAP) integration.

Key Features

Dynamic architecture that adapts to input data dimensions
Emerging clusters in the latent space are likely to reflect taxonomic relationships, aiding downstream analysis
Memory-efficient processing for large genomic datasets
Integration with UMAP for improved visualisation
Comprehensive quality monitoring and validation

Architecture Details

Input Layer and Preprocessing

Accepts normalised k-mer frequency vectors (typically 16,384 dimensions for 7-mers)
Implements Term Frequency-Inverse Document Frequency (TF-IDF) transformation to emphasise distinctive k-mer patterns
Includes batch normalisation [3] for stable training
Uses length normalisation to account for varying sequence sizes

Encoder Structure

Dynamic sizing: intermediate layers automatically scale based on input dimensions
Three main dense layers with decreasing dimensions (e.g., 2048 → 512 → 128)
Residual connections to preserve important sequence features
Regularisation through dropout [4] (10%) and L1 regularisation to prevent overfitting
Batch normalisation after each major layer

Latent Space

Default 32 dimensions (adjustable via n_components parameter)
Uses reparameterisation trick [5] with learned mean and variance
Implements Kullback-Leibler (KL) divergence warm-up to prevent posterior collapse
The learned latent space is likely to reflect taxonomic relationships between sequences, aiding downstream analysis

Decoder Structure

Mirror of encoder architecture with transposed dimensions
Additional regularisation to ensure non-negative outputs
Residual connections matching encoder structure
Final sigmoid activation to reconstruct normalised k-mer frequencies between 0-1

Activation Modes

The autoencoder supports multiple activation functions for its dense layers:

ReLU (Rectified Linear Unit): A widely used default activation, effective in preventing vanishing gradients
SeLU (Scaled Exponential Linear Unit): Self-normalising, suitable for deeper networks
Tanh: Maps inputs to [-1,1], useful for symmetric distributions
Sigmoid: Compresses outputs between [0,1], potentially limiting gradient updates
Linear: No transformation, mainly used for debugging

Empirical testing has shown that ReLU is the recommended default choice, with SeLU also performing well. Running all activation modes is generally unnecessary; ReLU is sufficient for most use cases.

Key Differences from Standard VAEs

Data-specific Modifications

Non-negativity constraints suited for k-mer counts
Dynamic architecture sizing based on input k-mer dimensionality
Standard mean squared error reconstruction loss with KL divergence weighted by adaptive β scheduling

Training Optimisations

Batch processing to handle large genome datasets
Memory requirements scale with dataset size (default 4GB, user-adjustable)
Modified KL divergence scheduling for stable training
Custom early stopping based on validation loss

Genomic-specific Features

Integration with taxonomic analysis workflows
Preservation of important genomic signals
Robust handling of varying sequence lengths

Training Process

Optimisation Strategy

Uses an adaptive learning rate strategy with Adam optimiser [6]
Implements early stopping to prevent overfitting
Monitors reconstruction quality and latent space structure
Automatically adjusts KL divergence weight

Parameters and Tuning

Latent space dimensions adjustable based on dataset complexity
Batch size and epoch count adapt to dataset size
Automatic prevention of common VAE issues like posterior collapse
Comprehensive monitoring tools for performance tracking

Output Files and Visualisation

Training History Plot (`vae_training_history.png`)

Four key training metrics:

Total Loss: Combined reconstruction and KL divergence loss
Reconstruction Loss: Measures accuracy of sequence reconstruction
KL Divergence Loss: Measures latent space distribution
Learning Rate: Shows adaptive learning rate behaviour

Embedding Visualisations

Generated for both raw VAE embeddings and UMAP transformations:

Main scatter plot of the 2D embedding
Density view showing point concentrations
Distribution plots for first two components

Training Log (`training_log.csv`)

Detailed per-epoch metrics including:

Total loss
Reconstruction loss
KL divergence loss
Validation metrics
Learning rate

Quality Metrics

Files containing embedding quality measures:

Trustworthiness: Measures how well local neighbourhoods are preserved in the embedding
- Values range from 0 to 1
- Values >0.9: Excellent preservation of local structure
- Values 0.7-0.9: Good preservation, suitable for most analyses
- Values <0.7: Poor preservation, consider adjusting parameters or using a different method
Continuity: Measures whether points that are close in the original space remain close in the embedding
- Values range from 0 to 1
- Values >0.9: Strong preservation of original relationships
- Values 0.7-0.9: Acceptable preservation of relationships
- Values <0.7: Significant distortion, may not be reliable for analysis
Both metrics should ideally be >0.7 for reliable embeddings
If either metric is consistently low, consider:
- Adjusting the number of components
- Using a different activation function
- Trying a different dimensionality reduction method

Coordinate Files

autoencoder_[activation mode]_raw_coordinates.csv: Raw embeddings from the autoencoder, where the file name reflects the activation mode used
kmers_dim_reduction_embeddings.csv: Final embeddings combining VAE latent space with UMAP projection, optimised for species separation. UMAP optimisation occurs when --skip_n_neighbors_optimisation isn't used (default).

Performance Considerations

Memory Management

Processes large genomic datasets through batch processing
Memory usage increases with dataset size and k-mer length
Default memory limit of 4GB (adjustable via configuration)
Automatically scales intermediate layer dimensions
Uses data normalisation strategies optimised for k-mer frequency data
Maintains numerical stability through appropriate regularisation
For datasets with <100 sequences, autoencoder components automatically reduce to sample_count-1
Warning: Large datasets may require significant memory resources

Integration with UMAP

Combines VAE's dimensionality reduction with UMAP's visualisation
Preserves both local and global structure
Enables interactive exploration of sequence relationships
Provides options for different visualisation schemes

Interpreting Results

While traditional GC-coverage plots provide a basic view of sequence composition, k-mer-based embeddings often reveal more nuanced taxonomic relationships. The first principal component (PC1) frequently correlates with GC content, but additional components and non-linear transformations can expose more subtle patterns in sequence composition that aid in taxonomic classification.

Applications

Primary Use Cases

Separating sequences from different organisms in mixed samples
Detecting contamination in sequencing data
Exploring relationships between sequences
Identifying unusual or outlier sequences

Limitations and Considerations

Very similar sequences might not be completely separated
Processing time increases with dataset size and k-mer length
Memory requirements scale with input dimensions
Training time can be significant for large datasets

References

Weber, Claudia C. "Disentangling cobionts and contamination in long-read genomic data using sequence composition." G3 Genes|Genomes|Genetics, 2024.
Hinton GE, Salakhutdinov RR. "Reducing the Dimensionality of Data with Neural Networks." Science, 2006.
Ioffe S, Szegedy C. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." arXiv:1502.03167, 2015.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." Journal of Machine Learning Research, 2014.
Kingma DP, Welling M. "Auto-Encoding Variational Bayes." arXiv:1312.6114, 2013.
Kingma DP, Ba J. "Adam: A Method for Stochastic Optimization." arXiv:1412.6980, 2014.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kmers_autoencoder.md

kmers_autoencoder.md

Variational Autoencoder Implementation for K-mer Analysis

Overview

Key Features

Architecture Details

Input Layer and Preprocessing

Encoder Structure

Latent Space

Decoder Structure

Activation Modes

Key Differences from Standard VAEs

Data-specific Modifications

Training Optimisations

Genomic-specific Features

Training Process

Optimisation Strategy

Parameters and Tuning

Output Files and Visualisation

Training History Plot (`vae_training_history.png`)

Embedding Visualisations

Training Log (`training_log.csv`)

Quality Metrics

Coordinate Files

Performance Considerations

Memory Management

Integration with UMAP

Interpreting Results

Applications

Primary Use Cases

Limitations and Considerations

References

Files

kmers_autoencoder.md

Latest commit

History

kmers_autoencoder.md

File metadata and controls

Variational Autoencoder Implementation for K-mer Analysis

Overview

Key Features

Architecture Details

Input Layer and Preprocessing

Encoder Structure

Latent Space

Decoder Structure

Activation Modes

Key Differences from Standard VAEs

Data-specific Modifications

Training Optimisations

Genomic-specific Features

Training Process

Optimisation Strategy

Parameters and Tuning

Output Files and Visualisation

Training History Plot (vae_training_history.png)

Embedding Visualisations

Training Log (training_log.csv)

Quality Metrics

Coordinate Files

Performance Considerations

Memory Management

Integration with UMAP

Interpreting Results

Applications

Primary Use Cases

Limitations and Considerations

References

Training History Plot (`vae_training_history.png`)

Training Log (`training_log.csv`)