PurifAI | Luke Perrin

PurifAI Logo

Overview

PurifAI is a machine learning–powered purification forecasting tool developed as the capstone project for the UC Berkeley Extension Data Analytics Bootcamp. It addresses a core bottleneck in high-throughput synthesis workflows: selecting the right solid-phase extraction (SPE) cartridge and LC/MS method for compound purification without relying solely on chemist intuition or trial-and-error.

The model ingests molecular structure data alongside historical purification outcomes, learns the relationship between molecular features and purification performance, and predicts which cartridge type and mass spectrometry method will yield the best result for a given compound — before the chemist ever loads a sample.

Problem Statement

In automated high-throughput chemistry labs, purification is a critical final step that determines compound quality and throughput capacity. Selecting the wrong SPE cartridge or LC/MS gradient can result in:

Poor purity outcomes requiring re-purification
Sample loss or compound decomposition
Significant analyst time wasted on iterative method development
Reduced throughput at the purification stage

Experienced chemists develop intuition for these selections over years of practice, but this knowledge is largely tacit and difficult to transfer. PurifAI captures it computationally.

Dataset & Pipeline

Data Collection

Historical purification records were extracted from the lab’s registration and ELN systems, including:

Compound SMILES strings
SPE cartridge type used (C18, SCX, SAX, mixed-mode, etc.)
LC/MS method selected (gradient, mobile phase, flow rate)
Purification outcome (purity %, yield %, mass recovery)

Records spanned several thousand purification runs across multiple compound libraries and therapeutic programs.

Feature Engineering

Molecular features were computed from SMILES using RDKit and PySpark on an AWS RDS backend:

Physicochemical descriptors: molecular weight, cLogP, TPSA, HBD/HBA counts, rotatable bonds, formal charge
Morgan fingerprints (ECFP4): radius-2 circular fingerprints (2048 bits) for structural encoding
MACCS keys: 166-bit structural keys for functional group presence
Fragment counts: ring systems, heteroatom counts, sp3 fraction
Custom SPE-relevant features: ionization state at pH 2, 7, and 10; predicted pKa; polar surface area partitioned by donor/acceptor type

Data Preprocessing

SMILES standardization via RDKit (salt stripping, charge normalization, aromaticity perception)
Duplicate detection using InChIKey hashing
Class balancing via stratified oversampling (SMOTE) for minority cartridge types
Train/validation/test split: 70/15/15 stratified by cartridge class

Models

PurifAI uses an ensemble voting classifier combining four base learners:

Model	Role	Hyperparameter Notes
Random Forest	Structural pattern recognition	500 estimators, max_depth=20, balanced class weights
XGBoost	Gradient boosting on tabular features	eta=0.05, max_depth=6, subsample=0.8
AdaBoost	Weak learner ensemble	200 estimators over Decision Tree base
Logistic Regression	Linear decision boundary baseline	L2 regularization, C=0.1, class_weight=‘balanced’

The final prediction uses soft voting (probability averaging) across all four models, weighting predictions by individual validation-set performance.

Model Performance

Cross-validated accuracy on cartridge selection:

Random Forest: ~78% (top-1 accuracy)
XGBoost: ~81%
AdaBoost: ~74%
Logistic Regression: ~69%
Ensemble (soft vote): ~84% — top-1 accuracy across all cartridge types

Top-2 accuracy (correct answer within top 2 predictions): ~93%

Architecture

[Compound SMILES]
        ↓
[RDKit Feature Extraction]  ←→  [AWS RDS — Historical Purification Records]
        ↓
[PySpark Feature Pipeline]  (normalization, fingerprint encoding)
        ↓
[Ensemble Classifier]
  ├─ Random Forest
  ├─ XGBoost
  ├─ AdaBoost
  └─ Logistic Regression
        ↓
[Soft Vote Aggregation]
        ↓
[Predicted SPE Cartridge + LCMS Method]
        ↓
[React Web App — User Interface]

Web Application

PurifAI is deployed as a React single-page application with a Python/Flask backend. The interface allows chemists to:

Input a compound SMILES or draw a structure
View computed molecular features and fingerprint heatmap
Receive ranked cartridge and method predictions with confidence scores
Explore similar compounds from the training set and their purification outcomes

The app connects to the AWS RDS instance for real-time feature retrieval and logs new predictions back to the database for continuous model improvement.

Technologies

Machine Learning & Chemistry

Python — core modeling and pipeline
scikit-learn — Random Forest, AdaBoost, Logistic Regression, ensemble voting
XGBoost — gradient boosted trees
RDKit — molecular feature extraction, fingerprinting, standardization
PySpark — distributed feature pipeline over large compound datasets
imbalanced-learn — SMOTE oversampling for class balancing

Cloud & Data Infrastructure

AWS RDS (PostgreSQL) — historical purification data warehouse
AWS S3 — model artifact storage
Boto3 — AWS SDK for Python

Web & Deployment

React — frontend SPA
Flask — REST API backend
Axios — HTTP client for frontend–backend communication

Key Learnings

This project surfaced several practical lessons in applied cheminformatics ML:

Class imbalance matters enormously — minority cartridge types (e.g., mixed-mode, SAX) were dramatically underrepresented and required aggressive oversampling to achieve useful accuracy
Morgan fingerprints + physicochemical descriptors outperform fingerprints alone — adding cLogP, TPSA, and pKa-derived features improved top-1 accuracy by ~6%
Soft voting over hard voting — allowing models to express uncertainty via probability averaging outperformed majority-rules voting by ~3%
Data quality is the real bottleneck — standardizing SMILES, deduplicating records, and removing incomplete outcome data had more impact than hyperparameter tuning

Source Code

This project can be found at luperrin/purifAI, please find additional details in the README.

View the source on GitHub →