Overview
The Cheminformatics Toolbox is a curated collection of Jupyter notebooks that serve as both a practical reference and a hands-on toolkit for research chemists. Each notebook focuses on a specific domain of cheminformatics operations, providing executable examples grounded in real scientific workflows rather than abstract documentation.
The Problem
Research chemists who need to work programmatically with chemical structures often face a steep learning curve. Existing documentation for tools like RDKit is comprehensive but highly technical — written for developers, not scientists. Chemists needed a resource that spoke their language while unlocking the power of computational methods.
The Solution
Notebook Modules
The toolbox is organized into thematic notebooks covering the full cheminformatics workflow:
1. Structure File I/O
- Reading and writing chemical structure files (SDF, MOL, SMILES, SMARTS)
- Format interconversion using RDKit and OpenBabel
- Batch processing of large compound libraries
2. Molecular Vectorization & Featurization
- Fingerprint generation: ECFP, Morgan, MACCS keys
- 1D/2D/3D descriptor calculation using RDKit
- Graph-based representations for GNN applications
- Physicochemical property profiling (MW, LogP, TPSA, HBD/HBA)
3. Molecular Transformations
- R-group decomposition — scaffold extraction and substituent enumeration
- Reaction-based enumeration using RXNSMARTS
- Atom-mapped transformations and bond modifications
- Combinatorial library generation
4. Molecular Dataset Analysis
- MCS (Maximum Common Substructure) analysis across compound sets
- Clustering methods: Butina, K-means, hierarchical, DBSCAN
- Tanimoto similarity matrices and nearest-neighbor analysis
- SAR analysis — property-activity relationship visualization
Libraries Covered
- RDKit — primary cheminformatics engine
- OpenBabel — format conversion and additional descriptors
- Schrödinger BBChem API — integration with Schrödinger’s tooling
- Additional Python chemistry modules and scikit-learn for ML integration
Design Philosophy
Each notebook prioritizes:
- User features over API completeness — showing what chemists actually need to do
- Real-world use cases — examples derived from actual synthesis campaign scenarios
- Scientific context — explaining why each operation matters, not just how
Impact
- Widely distributed across the chemistry organization and shared externally with partner companies
- Reduced onboarding time for computational methods among bench chemists
- Enabled non-programmers to execute complex cheminformatics workflows independently
- Served as a living reference document updated with new workflows as they emerged