Encoder with group loss for global structure preservation
David Novak (1,2), Sofie Van Gassen (1,2), Yvan Saeys (1,2)
(1) Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Belgium
(2) Data Mining and Modeling for Biomedicine, Center for Inflammation Research, VIB-UGent, Belgium
This work has been accepted for presentation at BNAIC/BeNeLearn 2023.
Recent advances in dimensionality reduction have achieved more accurate lower-dimensional embeddings of high-dimensional data. In addition to visualisation purposes, these embeddings can be used for downstream processing, including batch effect normalisation, clustering, community detection or trajectory inference. We use the notion of structure preservation at both local and global levels to create a deep learning model, based on a variational autoencoder (VAE) and the stochastic quartet loss from the SQuadMDS algorithm. Our encoder model, called GroupEnc, uses a ‘group loss’ function to create embeddings with less global structure distortion than VAEs do, while keeping the model parametric and the architecture flexible. We validate our approach using publicly available biological single-cell transcriptomic datasets, employing RNX curves for evaluation.
GroupEnc is a Python package built on top of TensorFlow. We recommend creating a new Anaconda environment for GroupEnc.
On Linux or macOS, use the command line for installation. On Windows, use Anaconda Prompt.
conda create --name GroupEnc python=3.9 \
numpy pandas
Next, activate the new environment and install tensorflow
.
TensorFlow installation is platform-specific.
GPU acceleration, when available, is highly recommended.
conda activate GroupEnc
pip install tensorflow
pip install tensorflow-metal
pip install --upgrade git+https://github.com/saeyslab/GroupEnc.git
Consult this tutorial in case of problems.
conda activate GroupEnc
conda install conda-forge cudatoolkit=11.2 cudnn=8.1.0
pip install "tensorflow=2.11"
pip install --upgrade git+https://github.com/saeyslab/GroupEnc.git
Consult this tutorial in case of problems.
conda activate GroupEnc
conda install -c conda-forge cudatoolkit=11.8.0
python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.*
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$CUDNN_PATH/lib:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
pip install --upgrade git+https://github.com/saeyslab/GroupEnc.git
Consult this tutorial in case of problems.
conda activate GroupEnc
pip install tensorflow
pip install --upgrade git+https://github.com/saeyslab/GroupEnc.git
Consult this tutorial in case of problems.
To verify correct installation of TensorFlow, activate the environment and run the following line:
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
This should return a non-empty list.
Unsupervised evaluation of dimensionality reduction can be done using RNX curves. This may require a large amount of RAM an may be intractable on large datasets or less-performant machines.
To compute RNX curves, install the nxcurve
package:
pip install nxcurve
To create an embedding using a VAE or GroupEnc model, use the ./scripts/embed.py
script with specified arguments.
To evaluate an existing embedding, use the ./scripts/score.py
scripts with specified arguments.
To run a benchmark on an HPC, use ./scripts/benchmark_embed.py
and ./scripts/benchmark_score.py
as starting points.
All scripts are documented, use the -h
or --help
flag to see usage.