This repository contains the code and data accompanying the paper "siRNA Features - Reproducible Structure-Based Chemical Features for Off-Target Prediction".
This project introduces a framework for generating reproducible structure-based chemical features for siRNA, incorporating both molecular fingerprints and computationally derived siRNA-AGO2 complex structures. The framework enables better prediction of off-target effects in siRNA therapeutics through advanced feature engineering and machine learning approaches.
- Structural prediction and modeling of siRNA-AGO2 complexes
- Extended Connectivity Fingerprint (ECFP) generation for siRNA modifications
- Multiple feature representation strategies (9 distinct datasets)
- Machine learning pipeline using AutoGluon
- Reproducible molecular dynamics simulations
# Clone the repository
git clone https://github.com/username/siRNA-Features.git
cd siRNA-Features
# Create and activate a conda environment
conda create -n sirna_features python=3.8
conda activate sirna_features
# Install dependencies
pip install -r requirements.txt
siRNA-Features/
├── data/ # Dataset storage
│ ├── raw/ # Raw RNA-Seq data
│ └── processed/ # Processed features and results
├── docs/ # Documentation
├── models/ # Trained models
├── notebooks/ # Jupyter notebooks
│ ├── RNASeq/ # RNA-Seq analysis notebooks
│ ├── alignments/ # Structure alignment notebooks
│ └── modifications/ # Chemical modification notebooks
├── src/ # Source code
│ ├── features/ # Feature generation
│ ├── models/ # Model training and evaluation
│ └── utils/ # Utility functions
├── tests/ # Unit tests
├── LICENSE
└── README.md
- Structural Prediction:
from src.features import structural_prediction
model = structural_prediction.ChaiDiscovery()
structure = model.predict_structure(sequence)
- ECFP Generation:
from src.features import fingerprints
ecfps = fingerprints.generate_ecfp(molecule, nbits=512, radius=2)
- Dataset Creation:
from src.features import dataset_builder
dataset = dataset_builder.create_dataset(structure, ecfps, type="dataset3")
from src.models import training
model = training.AutoGluonTrainer(dataset)
model.train()
The datasets used in this study are available at:
- RNA-Seq dataset: [Dataset Link]
- Processed features: [Features Link]
- Pre-trained models: [Models Link]
Our framework achieved:
- AUPRC scores up to 0.784 (Dataset 3)
- Robust performance across multiple feature representations
- Reproducible structural predictions
If you use this code or find our work helpful, please cite:
@article{richter2024sirna,
title={siRNA Features - Reproducible Structure-Based Chemical Features for Off-Target Prediction},
author={Richter, Michael and Admasub, Alem},
journal={Journal Name},
year={2024}
}
This project is licensed under the MIT License - see the LICENSE file for details.
We welcome contributions! Please see our contributing guidelines for details.
- Michael Richter - [email protected]
- Alem Admasub - [email protected]