Skip to content

Latest commit

 

History

History
62 lines (42 loc) · 3.5 KB

README.md

File metadata and controls

62 lines (42 loc) · 3.5 KB

Benchmarking Dimensionality Reduction Techniques on Chemical Datasets

Introduction

This repository contains the data and scripts required to reproduce the results presented in our paper on benchmarking dimensionality reduction techniques applied to chemical datasets. The datasets used for dimensionality reduction and optimization results are available on Zenodo

Repository Structure

  • src: Contains the essential code for data preprocessing, dimensionality reduction, optimization, analysis, and visualization.
  • datasets: Contains several datasets (ChEMBL subsets) used in the original study.
  • notebooks: Includes Jupyter notebooks used for data analysis and visualization.
  • results: Stores calculated metrics from the paper and some obtained embeddings for demonstration purposes (all embeddings are available on Zenodo).
  • scripts: Includes master scripts for data preparation, running benchmarks, and analyzing results.

Datasets

The datasets directory houses the chemical datasets used throughout the study.

Results

The results directory includes the optimized low-dimensional embeddings and all associated metrics.

Notebooks

The notebooks directory contains Jupyter notebooks for data analysis, visualization, and further exploration of the study's findings.

Code

Core code

The src/cdr_bench directory contains various components for dimensionality reduction benchmarking:

  • dr_methods/ – Directory containing code of a wrapper class different dimensionality reduction methods.
  • features/ – Contains code for feature extraction and processing.
  • io_utils/ – Utility code for input/output operations.
  • method_configs/ – Configuration files for different dimensionality reduction methods.
  • optimization/ – Code for optimization routines.
  • scoring/ – Contains code for scoring and evaluating methods.
  • visualization/ – Code for visualizing benchmarking results.

Scripts

The scripts directory contains the master scripts for data preparation, running benchmarks, and analyzing results:

  • run_optimization.py – Main script for running optimization processes.
  • analyze_results.py – Script for automated result analysis.
  • prepare_lolo.py – Script for splitting datasets in leave-one-library-out (LOLO) mode.

Notes

Dependency management

PDM was used for dependency management. Required packages are available under pdm.lock file.

Input/Output

Hierarchical Data Format (HDF5, .h5) file format is used to store the data on descriptors and optimization results. Examples of how to read and write the hierarchical data structures can be found under /notebooks/IO.ipynb.

Descriptors (features)

Morgan fingerprints and MACCS keys are available from RDKit. .

Generative topographic mapping

The results for the generative topographic mapping (GTM) in the original publication were obtained using an in-house GTM implementation. In this repository, an open-source implementation of the GTM algorithm – ugtm – was added for comparison.

Citation

If you use the code from this repository, please cite the following publication.