Benchmarking Dimensionality Reduction Techniques on Chemical Datasets

Introduction

This repository contains the data and scripts required to reproduce the results presented in our paper on benchmarking dimensionality reduction techniques applied to chemical datasets. The datasets used for dimensionality reduction and optimization results are available on Zenodo

Repository Structure

src: Contains the essential code for data preprocessing, dimensionality reduction, optimization, analysis, and visualization.
datasets: Contains several datasets (ChEMBL subsets) used in the original study.
notebooks: Includes Jupyter notebooks used for data analysis and visualization.
results: Stores calculated metrics from the paper and some obtained embeddings for demonstration purposes (all embeddings are available on Zenodo).
scripts: Includes master scripts for data preparation, running benchmarks, and analyzing results.

Datasets

The datasets directory houses the chemical datasets used throughout the study.

Results

The results directory includes the optimized low-dimensional embeddings and all associated metrics.

Notebooks

The notebooks directory contains Jupyter notebooks for data analysis, visualization, and further exploration of the study's findings.

Code

Core code

The src/cdr_bench directory contains various components for dimensionality reduction benchmarking:

dr_methods/ – Directory containing code of a wrapper class different dimensionality reduction methods.
features/ – Contains code for feature extraction and processing.
io_utils/ – Utility code for input/output operations.
method_configs/ – Configuration files for different dimensionality reduction methods.
optimization/ – Code for optimization routines.
scoring/ – Contains code for scoring and evaluating methods.
visualization/ – Code for visualizing benchmarking results.

Scripts

The scripts directory contains the master scripts for data preparation, running benchmarks, and analyzing results:

run_optimization.py – Main script for running optimization processes.
analyze_results.py – Script for automated result analysis.
prepare_lolo.py – Script for splitting datasets in leave-one-library-out (LOLO) mode.

Notes

Dependency management

PDM was used for dependency management. Required packages are available under pdm.lock file.

Input/Output

Hierarchical Data Format (HDF5, .h5) file format is used to store the data on descriptors and optimization results. Examples of how to read and write the hierarchical data structures can be found under /notebooks/IO.ipynb.

Descriptors (features)

Morgan fingerprints and MACCS keys are available from RDKit. .

Generative topographic mapping

The results for the generative topographic mapping (GTM) in the original publication were obtained using an in-house GTM implementation. In this repository, an open-source implementation of the GTM algorithm – ugtm – was added for comparison.

Citation

If you use the code from this repository, please cite the following publication.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
bench_configs		bench_configs
datasets		datasets
notebooks		notebooks
results		results
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking Dimensionality Reduction Techniques on Chemical Datasets

Introduction

Repository Structure

Datasets

Results

Notebooks

Code

Core code

Scripts

Notes

Dependency management

Input/Output

Descriptors (features)

Generative topographic mapping

Citation

About

Releases

Packages

Languages

AxelRolov/cdr_bench

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Dimensionality Reduction Techniques on Chemical Datasets

Introduction

Repository Structure

Datasets

Results

Notebooks

Code

Core code

Scripts

Notes

Dependency management

Input/Output

Descriptors (features)

Generative topographic mapping

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages