This repository contains the data and scripts required to reproduce the results presented in our paper on benchmarking dimensionality reduction techniques applied to chemical datasets. The datasets used for dimensionality reduction and optimization results are available on Zenodo
- src: Contains the essential code for data preprocessing, dimensionality reduction, optimization, analysis, and visualization.
- datasets: Contains several datasets (ChEMBL subsets) used in the original study.
- notebooks: Includes Jupyter notebooks used for data analysis and visualization.
- results: Stores calculated metrics from the paper and some obtained embeddings for demonstration purposes (all embeddings are available on Zenodo).
- scripts: Includes master scripts for data preparation, running benchmarks, and analyzing results.
The datasets
directory houses the chemical datasets used throughout the study.
The results
directory includes the optimized low-dimensional embeddings and all associated metrics.
The notebooks
directory contains Jupyter notebooks for data analysis, visualization, and further exploration of the study's findings.
The src/cdr_bench
directory contains various components for dimensionality reduction benchmarking:
dr_methods/
– Directory containing code of a wrapper class different dimensionality reduction methods.features/
– Contains code for feature extraction and processing.io_utils/
– Utility code for input/output operations.method_configs/
– Configuration files for different dimensionality reduction methods.optimization/
– Code for optimization routines.scoring/
– Contains code for scoring and evaluating methods.visualization/
– Code for visualizing benchmarking results.
The scripts
directory contains the master scripts for data preparation, running benchmarks, and analyzing results:
run_optimization.py
– Main script for running optimization processes.analyze_results.py
– Script for automated result analysis.prepare_lolo.py
– Script for splitting datasets in leave-one-library-out (LOLO) mode.
PDM was used for dependency management. Required packages are available under pdm.lock
file.
Hierarchical Data Format (HDF5, .h5
) file format is used to store the data on descriptors and optimization results. Examples of how to read and write the hierarchical data structures can be found under /notebooks/IO.ipynb
.
Morgan fingerprints and MACCS keys are available from RDKit. .
The results for the generative topographic mapping (GTM) in the original publication were obtained using an in-house GTM implementation. In this repository, an open-source implementation of the GTM algorithm – ugtm – was added for comparison.
If you use the code from this repository, please cite the following publication.