High-performance cheminformatics library written in pure C.
Key Features:
- 1600+ molecular descriptors
- SMILES canonicalization with stereochemistry support
- Molecular sanitization (salt removal, aromatization, neutralization, tautomers)
- 2D/3D molecular visualization with MMFF94 force field
- Multi-threaded batch processing with SIMD optimization
- Morgan algorithm for canonical ordering
- Stereochemistry preservation (@ chirality, E/Z double bonds)
- Smallest Set of Smallest Rings (SSSR) detection
- Aromaticity perception
- Isotope and hydrogen handling
- Salt removal: Identify and remove counter-ions (Na+, K+, Cl-, etc.)
- Aromatization: Perceive aromaticity using Hückel's rule (4n+2 π electrons)
- Kekulization: Convert aromatic bonds to alternating single/double
- Neutralization: Remove protonation states (R-NH3+ → R-NH2, R-O- → R-OH)
- Normalization: Standardize functional groups (nitro, sulfoxide, phosphate)
- Tautomer enumeration: Generate keto-enol, amide-imidic acid tautomers
- Cleanup options: Remove stereochemistry, isotopes, explicit hydrogens
| Category | Count | Description |
|---|---|---|
| Counts | 83 | Element counts (C, H, N, O, S, halogens), bond types, ring counts |
| Ratios | 30+ | Elemental ratios, hybridization ratios, electronegativity ratios |
| Topology | 20+ | Kier-Hall Chi indices, Zagreb indices, Wiener index, Balaban J |
| Electronic | 30+ | Gasteiger-Marsili charges (PEOE), electrotopological states |
| Steric | 20+ | Van der Waals volume, McGowan volume, TPSA |
| Energetic | 30+ | Born solvation proxies, FMO hardness, Hansen parameters |
| Fractional | 60+ | Molecular weight fractions, bond fractions |
| Hash | 30+ | SMILES hashing, n-gram hashes, MinHash signatures |
| Graph | 30+ | Density, centrality, clustering, spectral properties |
| Autocorrelation | 54 | Broto-Moreau 2D autocorrelations (ATS) lags 0-8 |
| Solubility | 1 | CLogS aqueous solubility |
| LogP/LogD | 2 | Wildman-Crippen LogP, Neural network LogD 7.4 (R²=0.94) |
| MQN | 42 | Molecular Quantum Numbers |
| VSA | 47 | SlogP_VSA, SMR_VSA, PEOE_VSA, EState_VSA |
| BCUT | 48 | Burden-CAS-University of Texas eigenvalues |
| Zagreb | 24 | Zagreb indices and variants |
| Information | 24 | Information content descriptors |
| Walk Counts | 36 | Molecular walk counts |
| E-State Sums | 32 | Electrotopological state sums |
| ETA | 24 | Extended topochemical atom indices |
| Ring Complexity | 18 | Ring system complexity measures |
| CPSA | 70 | Charged partial surface area |
| Moments | 42 | Molecular geometry moments |
| Aromatic | 64 | Aromatic system descriptors |
| Atom Pairs | 56 | Distance-based atom pair fingerprints |
| Framework | 40 | Molecular framework descriptors |
| Constitutional | 34 | Constitutional descriptors |
| Functional | 50+ | Carbonyl, nitrogen, sulfur, oxygen, heterocyclic scaffolds |
| Pharmacophore | 30+ | Pharmacophore points, density, drug-likeness metrics |
./cchem depict -S "CN1CCN(CC1)C/2=N/c4ccccc4Nc3sc(C)cc\23" -o olanzapine.png --colored-atoms --quality 100- 2D coordinate generation with automatic layout
- 3D coordinate generation with MMFF94 force field optimization
- Render styles: wireframe, sticks, ball-and-stick, spacefill, surface
- PNG, SVG, and JPEG output with configurable quality
- Atom coloring by element (CPK scheme)
- CSV and Parquet batch processing with parallel execution
- Dataset splitting: random, scaffold-based (Murcko), stratified
- Configurable train/validation/test ratios
- SIMD optimization (
-march=native) - Multi-threaded processing (pthreads)
- Link Time Optimization (LTO)
- Arena allocators and thread-local caches
- Batch compute functions per descriptor category
- Pipeline streaming for constant memory on large datasets
cchem is available as a Python package via pip:
pip install pycchemimport pycchem
# Canonicalize SMILES
pycchem.canonicalize("c1ccccc1") # => 'c1ccccc1'
# Parse molecule and compute descriptors
mol = pycchem.Molecule("CCO")
mol.descriptor("MolecularWeight") # => 46.069
descriptors = mol.descriptors() # => dict of all 1600+ descriptors
# Batch canonicalization
results = pycchem.canonicalize_batch(["CCO", "c1ccccc1", "CC(=O)O"])
# Sanitize molecules
pycchem.sanitize("[Na+].CC(=O)[O-]", flags="complete") # => 'C(=O)(O)C'
# Validate SMILES
pycchem.validate("CCO") # => True
pycchem.validate("invalid") # => FalseThe package also installs a cchem CLI command:
cchem canonicalize -S "c1ccccc1"
cchem compute -S "CCO" -d MolecularWeightSupports Python 3.9+ on Linux, macOS, and Windows.
| Library | Purpose | Required |
|---|---|---|
| pthreads | Parallel processing | Yes |
| libm | Mathematics functions | Yes |
| zlib | Compression (for Parquet) | Yes |
| zstd | Compression (for Parquet) | Yes |
| cairo | 2D vector graphics rendering | Optional |
| libjpeg | JPEG image output | Optional |
| carquet | Apache Parquet file support | Auto-fetched |
Cairo and libjpeg are only required for the depict command. Build with -DWITH_CAIRO=OFF to skip these dependencies.
# Full installation (with depict command)
brew install cairo jpeg zstd
# Minimal installation (without depict)
brew install zstd# Full installation (with depict command)
sudo apt-get install libcairo2-dev libjpeg-dev zlib1g-dev libzstd-dev
# Minimal installation (without depict)
sudo apt-get install zlib1g-dev libzstd-dev# Full installation
vcpkg install cairo libjpeg-turbo zstd zlib --triplet x64-windows
# Minimal installation
vcpkg install zstd zlib --triplet x64-windowsgit clone https://github.com/Vitruves/cchem.git
cd cchem
mkdir build && cd build
cmake ..
make -j$(nproc)# Single molecule
./cchem canonicalize -S "c1ccccc1"
# Output: c1ccccc1
# Batch processing
./cchem canonicalize -f molecules.csv -s smiles -c canonical -o output.csv -n 4# Complete sanitization (unsalt + aromatize + neutralize + normalize)
./cchem canonicalize -S "[Na+].CC(=O)[O-]" --sanitize complete
# Output: C(=O)(O)C (acetic acid, Na+ removed, carboxylate neutralized)
# Remove salts only
./cchem canonicalize -S "CCO.[Cl-].[Na+]" --sanitize unsalt
# Output: CCO
# Aromatize Kekule form
./cchem canonicalize -S "C1=CC=CC=C1" --sanitize aromatize
# Output: c1ccccc1
# Multiple operations
./cchem canonicalize -S "[NH3+]Cc1ccccc1.[Cl-]" --sanitize unsalt,neutralize,aromatize
# Output: c1ccc(cc1)C[NH2] (benzylamine)
# List tautomers
./cchem canonicalize -S "CC(=O)CC" --list-tautomers
# Output: CCC(=O)C (keto-enol tautomers)# Single molecule with specific descriptors
./cchem compute -S "CCO" -d CarbonCount,HydrogenCount,MolecularWeight
# All descriptors for a dataset (CSV or Parquet)
./cchem compute -f data.csv -s smiles -d all -o descriptors.csv -n 8
./cchem compute -f data.parquet -s smiles -d all -o descriptors.parquet -n 8
# List available descriptors
./cchem compute --list# 2D depiction (supports .jpg, .png, .svg output)
./cchem depict -S "c1ccccc1" -o benzene.png -m 2d
# 3D with MMFF94 optimization
./cchem depict -S "CCO" -o ethanol.jpg -m 3d -s balls-sticks --max-iter 500# Train/test split (80/20)
./cchem split -f data.csv -s smiles -o train.csv,test.csv
# Train/validation/test with scaffold splitting
./cchem split -f data.csv -s smiles -o train.csv,val.csv,test.csv \
--split-ratios 80,10,10 --splitting-method scaffoldConvert SMILES to canonical form (aromatized by default) with optional sanitization.
Usage: cchem canonicalize [options]
Options:
-S, --smiles <string> Single SMILES string to canonicalize
-f, --file <path> Input CSV file
-s, --smiles-col <name> Column name containing SMILES (default: smiles)
-c, --canon-col <name> Output column name for canonical SMILES
-o, --output <path> Output file path
-n, --threads <num> Number of threads (default: auto)
--sanitize <opts> Apply sanitization before canonicalization
Values: "complete" or comma-separated list of:
unsalt - Remove salts, keep largest fragment
aromatize - Perceive and apply aromaticity (default)
kekulize - Convert aromatic to Kekule form
neutralize - Neutralize charges
normalize - Normalize functional groups
remove-stereo - Remove stereochemistry
remove-isotopes - Remove isotope labels
remove-h - Remove explicit hydrogens
validate - Validate structure
--list-tautomers List all tautomeric forms
-v, --verbose Enable verbose output
-h, --help Print this help message
Calculate molecular descriptors.
Usage: cchem compute [options]
Options:
-S, --smiles <string> Single SMILES string
-f, --file <path> Input CSV file
-s, --smiles-col <name> Column name containing SMILES (default: smiles)
-d, --descriptors <list> Comma-separated descriptor names or "all"
-o, --output <path> Output file path
-n, --threads <num> Number of threads (default: 1)
-l, --list List all available descriptors
--no-canonicalization Skip SMILES canonicalization
-v, --verbose Enable verbose output
Generate 2D/3D molecular structure images.
Usage: cchem depict [options]
Options:
-S, --smiles <string> SMILES string to depict
-o, --output <path> Output file path (.jpg, .png, or .svg)
-m, --mode <2d|3d> Rendering mode (default: 2d)
-s, --style <style> Render style:
wireframe - Simple lines
sticks - Colored bond sticks
balls-sticks - Ball and stick model
spacefill - CPK space-filling
surface - Molecular surface
-W, --width <pixels> Image width (default: 800)
-H, --height <pixels> Image height (default: 800)
--bond-length <float> Bond length in pixels (default: 35)
--bond-width <float> Bond width in pixels (default: 2)
--margin <pixels> Image margin (default: 20)
--show-carbons Show carbon atom labels
--show-hydrogens Show hydrogen atoms
--terminal-carbons Show terminal carbon labels (e.g., CH3)
--toggle-aromaticity Toggle aromatic ring display
--colored-atoms Color heteroatom labels by element
--proportional-atoms Scale atom labels proportionally
--atom-filling <float> Atom sphere filling (0.0-1.0)
--quality <int> JPEG quality (1-100, default: 90)
--max-iter <int> Max MMFF94 iterations (default: 200)
--surface-color <hex> Surface color (e.g., 0x808080)
--scale <float> Scale factor for rendering
--font-scale <float> Font scale factor
--heteroatom-gap <float> Gap around heteroatom labels
--line-cap <style> Line cap style (butt, round, square)
--transparent-background Transparent background (PNG/SVG only)
--debug Enable debug output
Split datasets for machine learning.
Usage: cchem split [options]
Options:
-f, --file <path> Input CSV file
-o, --output <paths> Comma-separated output file paths
-s, --smiles-col <name> Column containing SMILES (default: smiles)
-n, --threads <num> Number of threads for processing
--split-ratios <ratios> Comma-separated percentages (default: 80,20)
--splitting-method <method> Splitting method:
random - Random split
scaffold - Murcko scaffold-based
--stratified Use stratified splitting
--seed <int> Random seed for reproducibility
Check SMILES syntax validity.
Usage: cchem validate [options]
Options:
-S, --smiles <string> SMILES string to validate
-v, --verbose Show detailed validation info
cchem version # Show version information
cchem help # Show help message- CMake 3.16+
- C11 compatible compiler (GCC, Clang, MSVC)
- Required: zlib, zstd
- Optional: cairo, libjpeg (for
depictcommand)
mkdir build && cd build
cmake ..
make -j$(nproc)| Option | Default | Description |
|---|---|---|
CMAKE_BUILD_TYPE |
Release |
Build type: Release, Debug, RelWithDebInfo |
ENABLE_SANITIZERS |
OFF |
Enable AddressSanitizer and UBSan (Debug only) |
ENABLE_NATIVE_ARCH |
ON |
Optimize for native CPU (-march=native) |
WITH_CAIRO |
ON |
Enable Cairo for 2D rendering and depict command |
BUILD_BENCHMARKS |
OFF |
Build performance benchmarks |
WITH_RDKIT |
OFF |
Include RDKit in benchmarks |
WITH_OPENBABEL |
OFF |
Include OpenBabel in benchmarks |
# Standard release build (default)
cmake -DCMAKE_BUILD_TYPE=Release ..
# Debug build with sanitizers
cmake -DCMAKE_BUILD_TYPE=Debug -DENABLE_SANITIZERS=ON ..
# Portable build (disable native CPU optimization)
cmake -DENABLE_NATIVE_ARCH=OFF ..
# Minimal build without Cairo/JPEG (no depict command)
cmake -DWITH_CAIRO=OFF ..
# Build with benchmarks
cmake -DBUILD_BENCHMARKS=ON -DWITH_RDKIT=ON ..For systems without Cairo or when 2D rendering is not needed:
cmake -DWITH_CAIRO=OFF ..
make -j$(nproc)This disables:
- 2D/3D molecular visualization (
depictcommand) - MMFF94 force field
- JPEG dependency
Core functionality (canonicalization, descriptors, splitting) remains fully available.
cd build
ctest # Run all tests
ctest -V # Verbose output
ctest -R test_canon # Run specific testcchem/
├── include/cchem/
│ ├── cchem.h # Main library header
│ ├── descriptors.h # Descriptor system
│ ├── canonicalizer/ # SMILES parsing & canonicalization
│ │ ├── sanitize.h # Sanitization API
│ │ └── ... # Parser, molecule, stereo, etc.
│ └── depictor/ # 2D/3D visualization
├── src/
│ ├── main.c # CLI entry point
│ ├── canonicalizer/ # SMILES implementation
│ │ ├── sanitize.c # Sanitization implementation
│ │ └── ...
│ ├── descriptors/ # 500+ descriptor implementations
│ └── depictor/ # Visualization & MMFF94
├── tests/ # Test suite
├── data/ # Training data & test files
└── doc/ # Documentation
Licensed under the Apache License, Version 2.0.
Copyright 2025
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

