Skip to content

Vitruves/cchem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cchem

cchem logo

High-performance cheminformatics library written in pure C.

Key Features:

  • 1600+ molecular descriptors
  • SMILES canonicalization with stereochemistry support
  • Molecular sanitization (salt removal, aromatization, neutralization, tautomers)
  • 2D/3D molecular visualization with MMFF94 force field
  • Multi-threaded batch processing with SIMD optimization

Features

SMILES Canonicalization

  • Morgan algorithm for canonical ordering
  • Stereochemistry preservation (@ chirality, E/Z double bonds)
  • Smallest Set of Smallest Rings (SSSR) detection
  • Aromaticity perception
  • Isotope and hydrogen handling

Molecular Sanitization

  • Salt removal: Identify and remove counter-ions (Na+, K+, Cl-, etc.)
  • Aromatization: Perceive aromaticity using Hückel's rule (4n+2 π electrons)
  • Kekulization: Convert aromatic bonds to alternating single/double
  • Neutralization: Remove protonation states (R-NH3+ → R-NH2, R-O- → R-OH)
  • Normalization: Standardize functional groups (nitro, sulfoxide, phosphate)
  • Tautomer enumeration: Generate keto-enol, amide-imidic acid tautomers
  • Cleanup options: Remove stereochemistry, isotopes, explicit hydrogens

Molecular Descriptors (1600+)

Category Count Description
Counts 83 Element counts (C, H, N, O, S, halogens), bond types, ring counts
Ratios 30+ Elemental ratios, hybridization ratios, electronegativity ratios
Topology 20+ Kier-Hall Chi indices, Zagreb indices, Wiener index, Balaban J
Electronic 30+ Gasteiger-Marsili charges (PEOE), electrotopological states
Steric 20+ Van der Waals volume, McGowan volume, TPSA
Energetic 30+ Born solvation proxies, FMO hardness, Hansen parameters
Fractional 60+ Molecular weight fractions, bond fractions
Hash 30+ SMILES hashing, n-gram hashes, MinHash signatures
Graph 30+ Density, centrality, clustering, spectral properties
Autocorrelation 54 Broto-Moreau 2D autocorrelations (ATS) lags 0-8
Solubility 1 CLogS aqueous solubility
LogP/LogD 2 Wildman-Crippen LogP, Neural network LogD 7.4 (R²=0.94)
MQN 42 Molecular Quantum Numbers
VSA 47 SlogP_VSA, SMR_VSA, PEOE_VSA, EState_VSA
BCUT 48 Burden-CAS-University of Texas eigenvalues
Zagreb 24 Zagreb indices and variants
Information 24 Information content descriptors
Walk Counts 36 Molecular walk counts
E-State Sums 32 Electrotopological state sums
ETA 24 Extended topochemical atom indices
Ring Complexity 18 Ring system complexity measures
CPSA 70 Charged partial surface area
Moments 42 Molecular geometry moments
Aromatic 64 Aromatic system descriptors
Atom Pairs 56 Distance-based atom pair fingerprints
Framework 40 Molecular framework descriptors
Constitutional 34 Constitutional descriptors
Functional 50+ Carbonyl, nitrogen, sulfur, oxygen, heterocyclic scaffolds
Pharmacophore 30+ Pharmacophore points, density, drug-likeness metrics

2D/3D Visualization

Olanzapine 2D depiction

./cchem depict -S "CN1CCN(CC1)C/2=N/c4ccccc4Nc3sc(C)cc\23" -o olanzapine.png --colored-atoms --quality 100
  • 2D coordinate generation with automatic layout
  • 3D coordinate generation with MMFF94 force field optimization
  • Render styles: wireframe, sticks, ball-and-stick, spacefill, surface
  • PNG, SVG, and JPEG output with configurable quality
  • Atom coloring by element (CPK scheme)

Dataset Utilities

  • CSV and Parquet batch processing with parallel execution
  • Dataset splitting: random, scaffold-based (Murcko), stratified
  • Configurable train/validation/test ratios

Performance

  • SIMD optimization (-march=native)
  • Multi-threaded processing (pthreads)
  • Link Time Optimization (LTO)
  • Arena allocators and thread-local caches
  • Batch compute functions per descriptor category
  • Pipeline streaming for constant memory on large datasets

Python Bindings

cchem is available as a Python package via pip:

pip install pycchem
import pycchem

# Canonicalize SMILES
pycchem.canonicalize("c1ccccc1")  # => 'c1ccccc1'

# Parse molecule and compute descriptors
mol = pycchem.Molecule("CCO")
mol.descriptor("MolecularWeight")  # => 46.069
descriptors = mol.descriptors()     # => dict of all 1600+ descriptors

# Batch canonicalization
results = pycchem.canonicalize_batch(["CCO", "c1ccccc1", "CC(=O)O"])

# Sanitize molecules
pycchem.sanitize("[Na+].CC(=O)[O-]", flags="complete")  # => 'C(=O)(O)C'

# Validate SMILES
pycchem.validate("CCO")      # => True
pycchem.validate("invalid")  # => False

The package also installs a cchem CLI command:

cchem canonicalize -S "c1ccccc1"
cchem compute -S "CCO" -d MolecularWeight

Supports Python 3.9+ on Linux, macOS, and Windows.

C Library Installation

Dependencies

Library Purpose Required
pthreads Parallel processing Yes
libm Mathematics functions Yes
zlib Compression (for Parquet) Yes
zstd Compression (for Parquet) Yes
cairo 2D vector graphics rendering Optional
libjpeg JPEG image output Optional
carquet Apache Parquet file support Auto-fetched

Cairo and libjpeg are only required for the depict command. Build with -DWITH_CAIRO=OFF to skip these dependencies.

macOS (Homebrew)

# Full installation (with depict command)
brew install cairo jpeg zstd

# Minimal installation (without depict)
brew install zstd

Linux (apt)

# Full installation (with depict command)
sudo apt-get install libcairo2-dev libjpeg-dev zlib1g-dev libzstd-dev

# Minimal installation (without depict)
sudo apt-get install zlib1g-dev libzstd-dev

Windows (vcpkg)

# Full installation
vcpkg install cairo libjpeg-turbo zstd zlib --triplet x64-windows

# Minimal installation
vcpkg install zstd zlib --triplet x64-windows

Build

git clone https://github.com/Vitruves/cchem.git
cd cchem
mkdir build && cd build
cmake ..
make -j$(nproc)

Quick Start

Canonicalize a SMILES

# Single molecule
./cchem canonicalize -S "c1ccccc1"
# Output: c1ccccc1

# Batch processing
./cchem canonicalize -f molecules.csv -s smiles -c canonical -o output.csv -n 4

Sanitize Molecules

# Complete sanitization (unsalt + aromatize + neutralize + normalize)
./cchem canonicalize -S "[Na+].CC(=O)[O-]" --sanitize complete
# Output: C(=O)(O)C  (acetic acid, Na+ removed, carboxylate neutralized)

# Remove salts only
./cchem canonicalize -S "CCO.[Cl-].[Na+]" --sanitize unsalt
# Output: CCO

# Aromatize Kekule form
./cchem canonicalize -S "C1=CC=CC=C1" --sanitize aromatize
# Output: c1ccccc1

# Multiple operations
./cchem canonicalize -S "[NH3+]Cc1ccccc1.[Cl-]" --sanitize unsalt,neutralize,aromatize
# Output: c1ccc(cc1)C[NH2]  (benzylamine)

# List tautomers
./cchem canonicalize -S "CC(=O)CC" --list-tautomers
# Output: CCC(=O)C  (keto-enol tautomers)

Compute Descriptors

# Single molecule with specific descriptors
./cchem compute -S "CCO" -d CarbonCount,HydrogenCount,MolecularWeight

# All descriptors for a dataset (CSV or Parquet)
./cchem compute -f data.csv -s smiles -d all -o descriptors.csv -n 8
./cchem compute -f data.parquet -s smiles -d all -o descriptors.parquet -n 8

# List available descriptors
./cchem compute --list

Generate Molecular Images

# 2D depiction (supports .jpg, .png, .svg output)
./cchem depict -S "c1ccccc1" -o benzene.png -m 2d

# 3D with MMFF94 optimization
./cchem depict -S "CCO" -o ethanol.jpg -m 3d -s balls-sticks --max-iter 500

Split Dataset

# Train/test split (80/20)
./cchem split -f data.csv -s smiles -o train.csv,test.csv

# Train/validation/test with scaffold splitting
./cchem split -f data.csv -s smiles -o train.csv,val.csv,test.csv \
    --split-ratios 80,10,10 --splitting-method scaffold

CLI Reference

canonicalize

Convert SMILES to canonical form (aromatized by default) with optional sanitization.

Usage: cchem canonicalize [options]

Options:
  -S, --smiles <string>     Single SMILES string to canonicalize
  -f, --file <path>         Input CSV file
  -s, --smiles-col <name>   Column name containing SMILES (default: smiles)
  -c, --canon-col <name>    Output column name for canonical SMILES
  -o, --output <path>       Output file path
  -n, --threads <num>       Number of threads (default: auto)
  --sanitize <opts>         Apply sanitization before canonicalization
                            Values: "complete" or comma-separated list of:
                              unsalt          - Remove salts, keep largest fragment
                              aromatize       - Perceive and apply aromaticity (default)
                              kekulize        - Convert aromatic to Kekule form
                              neutralize      - Neutralize charges
                              normalize       - Normalize functional groups
                              remove-stereo   - Remove stereochemistry
                              remove-isotopes - Remove isotope labels
                              remove-h        - Remove explicit hydrogens
                              validate        - Validate structure
  --list-tautomers          List all tautomeric forms
  -v, --verbose             Enable verbose output
  -h, --help                Print this help message

compute

Calculate molecular descriptors.

Usage: cchem compute [options]

Options:
  -S, --smiles <string>     Single SMILES string
  -f, --file <path>         Input CSV file
  -s, --smiles-col <name>   Column name containing SMILES (default: smiles)
  -d, --descriptors <list>  Comma-separated descriptor names or "all"
  -o, --output <path>       Output file path
  -n, --threads <num>       Number of threads (default: 1)
  -l, --list                List all available descriptors
  --no-canonicalization     Skip SMILES canonicalization
  -v, --verbose             Enable verbose output

depict

Generate 2D/3D molecular structure images.

Usage: cchem depict [options]

Options:
  -S, --smiles <string>     SMILES string to depict
  -o, --output <path>       Output file path (.jpg, .png, or .svg)
  -m, --mode <2d|3d>        Rendering mode (default: 2d)
  -s, --style <style>       Render style:
                              wireframe    - Simple lines
                              sticks       - Colored bond sticks
                              balls-sticks - Ball and stick model
                              spacefill    - CPK space-filling
                              surface      - Molecular surface
  -W, --width <pixels>      Image width (default: 800)
  -H, --height <pixels>     Image height (default: 800)
  --bond-length <float>     Bond length in pixels (default: 35)
  --bond-width <float>      Bond width in pixels (default: 2)
  --margin <pixels>         Image margin (default: 20)
  --show-carbons            Show carbon atom labels
  --show-hydrogens          Show hydrogen atoms
  --terminal-carbons        Show terminal carbon labels (e.g., CH3)
  --toggle-aromaticity      Toggle aromatic ring display
  --colored-atoms           Color heteroatom labels by element
  --proportional-atoms      Scale atom labels proportionally
  --atom-filling <float>    Atom sphere filling (0.0-1.0)
  --quality <int>           JPEG quality (1-100, default: 90)
  --max-iter <int>          Max MMFF94 iterations (default: 200)
  --surface-color <hex>     Surface color (e.g., 0x808080)
  --scale <float>           Scale factor for rendering
  --font-scale <float>      Font scale factor
  --heteroatom-gap <float>  Gap around heteroatom labels
  --line-cap <style>        Line cap style (butt, round, square)
  --transparent-background  Transparent background (PNG/SVG only)
  --debug                   Enable debug output

split

Split datasets for machine learning.

Usage: cchem split [options]

Options:
  -f, --file <path>              Input CSV file
  -o, --output <paths>           Comma-separated output file paths
  -s, --smiles-col <name>        Column containing SMILES (default: smiles)
  -n, --threads <num>            Number of threads for processing
  --split-ratios <ratios>        Comma-separated percentages (default: 80,20)
  --splitting-method <method>    Splitting method:
                                   random   - Random split
                                   scaffold - Murcko scaffold-based
  --stratified                   Use stratified splitting
  --seed <int>                   Random seed for reproducibility

validate

Check SMILES syntax validity.

Usage: cchem validate [options]

Options:
  -S, --smiles <string>     SMILES string to validate
  -v, --verbose             Show detailed validation info

version / help

cchem version    # Show version information
cchem help       # Show help message

Building from Source

Requirements

  • CMake 3.16+
  • C11 compatible compiler (GCC, Clang, MSVC)
  • Required: zlib, zstd
  • Optional: cairo, libjpeg (for depict command)

Build Commands

mkdir build && cd build
cmake ..
make -j$(nproc)

Build Options

Option Default Description
CMAKE_BUILD_TYPE Release Build type: Release, Debug, RelWithDebInfo
ENABLE_SANITIZERS OFF Enable AddressSanitizer and UBSan (Debug only)
ENABLE_NATIVE_ARCH ON Optimize for native CPU (-march=native)
WITH_CAIRO ON Enable Cairo for 2D rendering and depict command
BUILD_BENCHMARKS OFF Build performance benchmarks
WITH_RDKIT OFF Include RDKit in benchmarks
WITH_OPENBABEL OFF Include OpenBabel in benchmarks
# Standard release build (default)
cmake -DCMAKE_BUILD_TYPE=Release ..

# Debug build with sanitizers
cmake -DCMAKE_BUILD_TYPE=Debug -DENABLE_SANITIZERS=ON ..

# Portable build (disable native CPU optimization)
cmake -DENABLE_NATIVE_ARCH=OFF ..

# Minimal build without Cairo/JPEG (no depict command)
cmake -DWITH_CAIRO=OFF ..

# Build with benchmarks
cmake -DBUILD_BENCHMARKS=ON -DWITH_RDKIT=ON ..

Minimal Build

For systems without Cairo or when 2D rendering is not needed:

cmake -DWITH_CAIRO=OFF ..
make -j$(nproc)

This disables:

  • 2D/3D molecular visualization (depict command)
  • MMFF94 force field
  • JPEG dependency

Core functionality (canonicalization, descriptors, splitting) remains fully available.

Running Tests

cd build
ctest                    # Run all tests
ctest -V                 # Verbose output
ctest -R test_canon      # Run specific test

Project Structure

cchem/
├── include/cchem/
│   ├── cchem.h              # Main library header
│   ├── descriptors.h        # Descriptor system
│   ├── canonicalizer/       # SMILES parsing & canonicalization
│   │   ├── sanitize.h       # Sanitization API
│   │   └── ...              # Parser, molecule, stereo, etc.
│   └── depictor/            # 2D/3D visualization
├── src/
│   ├── main.c               # CLI entry point
│   ├── canonicalizer/       # SMILES implementation
│   │   ├── sanitize.c       # Sanitization implementation
│   │   └── ...
│   ├── descriptors/         # 500+ descriptor implementations
│   └── depictor/            # Visualization & MMFF94
├── tests/                   # Test suite
├── data/                    # Training data & test files
└── doc/                     # Documentation

License

Licensed under the Apache License, Version 2.0.

Copyright 2025

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

About

High-performance cheminformatics library written in pure C.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages