All Projects
Machine LearningCheminformaticsPythonAWSReactPySparkRDKit

PurifAI

Machine learning model for predicting optimal SPE purification and LCMS analysis methods for HTS compound purification — deployed as a React web application

PurifAI Logo

Overview

PurifAI is a machine learning–powered purification forecasting tool developed as the capstone project for the UC Berkeley Extension Data Analytics Bootcamp. It addresses a core bottleneck in high-throughput synthesis workflows: selecting the right solid-phase extraction (SPE) cartridge and LC/MS method for compound purification without relying solely on chemist intuition or trial-and-error.

The model ingests molecular structure data alongside historical purification outcomes, learns the relationship between molecular features and purification performance, and predicts which cartridge type and mass spectrometry method will yield the best result for a given compound — before the chemist ever loads a sample.

Problem Statement

In automated high-throughput chemistry labs, purification is a critical final step that determines compound quality and throughput capacity. Selecting the wrong SPE cartridge or LC/MS gradient can result in:

  • Poor purity outcomes requiring re-purification
  • Sample loss or compound decomposition
  • Significant analyst time wasted on iterative method development
  • Reduced throughput at the purification stage

Experienced chemists develop intuition for these selections over years of practice, but this knowledge is largely tacit and difficult to transfer. PurifAI captures it computationally.

Dataset & Pipeline

Data Collection

Historical purification records were extracted from the lab’s registration and ELN systems, including:

  • Compound SMILES strings
  • SPE cartridge type used (C18, SCX, SAX, mixed-mode, etc.)
  • LC/MS method selected (gradient, mobile phase, flow rate)
  • Purification outcome (purity %, yield %, mass recovery)

Records spanned several thousand purification runs across multiple compound libraries and therapeutic programs.

Feature Engineering

Molecular features were computed from SMILES using RDKit and PySpark on an AWS RDS backend:

  • Physicochemical descriptors: molecular weight, cLogP, TPSA, HBD/HBA counts, rotatable bonds, formal charge
  • Morgan fingerprints (ECFP4): radius-2 circular fingerprints (2048 bits) for structural encoding
  • MACCS keys: 166-bit structural keys for functional group presence
  • Fragment counts: ring systems, heteroatom counts, sp3 fraction
  • Custom SPE-relevant features: ionization state at pH 2, 7, and 10; predicted pKa; polar surface area partitioned by donor/acceptor type

Data Preprocessing

  • SMILES standardization via RDKit (salt stripping, charge normalization, aromaticity perception)
  • Duplicate detection using InChIKey hashing
  • Class balancing via stratified oversampling (SMOTE) for minority cartridge types
  • Train/validation/test split: 70/15/15 stratified by cartridge class

Models

PurifAI uses an ensemble voting classifier combining four base learners:

ModelRoleHyperparameter Notes
Random ForestStructural pattern recognition500 estimators, max_depth=20, balanced class weights
XGBoostGradient boosting on tabular featureseta=0.05, max_depth=6, subsample=0.8
AdaBoostWeak learner ensemble200 estimators over Decision Tree base
Logistic RegressionLinear decision boundary baselineL2 regularization, C=0.1, class_weight=‘balanced’

The final prediction uses soft voting (probability averaging) across all four models, weighting predictions by individual validation-set performance.

Model Performance

Cross-validated accuracy on cartridge selection:

  • Random Forest: ~78% (top-1 accuracy)
  • XGBoost: ~81%
  • AdaBoost: ~74%
  • Logistic Regression: ~69%
  • Ensemble (soft vote): ~84% — top-1 accuracy across all cartridge types

Top-2 accuracy (correct answer within top 2 predictions): ~93%

Architecture

[Compound SMILES]

[RDKit Feature Extraction]  ←→  [AWS RDS — Historical Purification Records]

[PySpark Feature Pipeline]  (normalization, fingerprint encoding)

[Ensemble Classifier]
  ├─ Random Forest
  ├─ XGBoost
  ├─ AdaBoost
  └─ Logistic Regression

[Soft Vote Aggregation]

[Predicted SPE Cartridge + LCMS Method]

[React Web App — User Interface]

Web Application

PurifAI is deployed as a React single-page application with a Python/Flask backend. The interface allows chemists to:

  1. Input a compound SMILES or draw a structure
  2. View computed molecular features and fingerprint heatmap
  3. Receive ranked cartridge and method predictions with confidence scores
  4. Explore similar compounds from the training set and their purification outcomes

The app connects to the AWS RDS instance for real-time feature retrieval and logs new predictions back to the database for continuous model improvement.

Technologies

Machine Learning & Chemistry

  • Python — core modeling and pipeline
  • scikit-learn — Random Forest, AdaBoost, Logistic Regression, ensemble voting
  • XGBoost — gradient boosted trees
  • RDKit — molecular feature extraction, fingerprinting, standardization
  • PySpark — distributed feature pipeline over large compound datasets
  • imbalanced-learn — SMOTE oversampling for class balancing

Cloud & Data Infrastructure

  • AWS RDS (PostgreSQL) — historical purification data warehouse
  • AWS S3 — model artifact storage
  • Boto3 — AWS SDK for Python

Web & Deployment

  • React — frontend SPA
  • Flask — REST API backend
  • Axios — HTTP client for frontend–backend communication

Key Learnings

This project surfaced several practical lessons in applied cheminformatics ML:

  • Class imbalance matters enormously — minority cartridge types (e.g., mixed-mode, SAX) were dramatically underrepresented and required aggressive oversampling to achieve useful accuracy
  • Morgan fingerprints + physicochemical descriptors outperform fingerprints alone — adding cLogP, TPSA, and pKa-derived features improved top-1 accuracy by ~6%
  • Soft voting over hard voting — allowing models to express uncertainty via probability averaging outperformed majority-rules voting by ~3%
  • Data quality is the real bottleneck — standardizing SMILES, deduplicating records, and removing incomplete outcome data had more impact than hyperparameter tuning

Source Code

This project can be found at luperrin/purifAI, please find additional details in the README.

View the source on GitHub →

Next

Retrosynthetic AI