This guide walks through running a complete benchmark comparing ferro-hgvs against mutalyzer, biocommons/hgvs, and hgvs-rs.
- Overview
- Quick Start
- Prerequisites & Build
- Download ClinVar Data
- Network Fetch Behavior
- Network vs Local Performance Comparison
- UTA Alignment Loading
- Recommended Workflow
- Prepare All Tools
- Check All Tools
- Parse Comparison: Small & Medium
- Normalize Comparison: Small & Medium
- Parse Comparison: Full ClinVar
- Normalize Comparison: Full ClinVar
- Troubleshooting
- CLI Command Reference
| Tool | Type | Language | Speed (local) | Speed (network) | ferro Speedup |
|---|---|---|---|---|---|
| ferro-hgvs | Parser + Normalizer | Rust | ~4M patterns/sec | N/A | N/A |
| mutalyzer | Parser + Normalizer | Python | ~20 patterns/sec | ~1 pattern/sec | 20x |
| biocommons/hgvs | Parser + Normalizer | Python | ~20 patterns/sec | ~0.2 patterns/sec | 100x |
| hgvs-rs | Parser + Normalizer | Rust | ~2 patterns/sec | ~0.2 patterns/sec | 10x |
Results in this guide were generated with:
| Component | Version |
|---|---|
| ferro-hgvs | 0.1.0 |
| mutalyzer-hgvs-parser | 0.3.9 |
| mutalyzer-algebra | 1.5.2 |
| biocommons hgvs | 1.5.6 |
| hgvs-rs | 0.19.1 |
| cdot | 0.2.32 (RefSeq GRCh38) |
| SeqRepo | 2021-01-29 |
| UTA | uta_20210129b |
| ClinVar | ~57.5M patterns (Dec 2024) |
Note: Results may vary with different versions or ClinVar releases.
| Tier | Patterns | Tools | Approximate Runtime |
|---|---|---|---|
| Smoke | 100 | All 4 | ~10 sec |
| Small | 1,000 | All 4 | ~1-2 min |
| Medium | 10,000 | All 4 | ~10-20 min |
| Full ClinVar | ~57.5M | ferro + mutalyzer + hgvs-rs | ~2-4 hours |
Note: biocommons/hgvs doesn't scale to full ClinVar due to database query overhead (~27 days estimated).
| Pattern Type | ferro | mutalyzer | biocommons | hgvs-rs |
|---|---|---|---|---|
| Genomic (g.) | ✓ | ✓ | ✓ | ✓ |
| Coding (c.) exonic | ✓ | ✓ | ✓ | ✓ |
| Coding (c.) intronic | ✓ | ✓** | ✗ | ✗ |
| Non-coding (n.) | ✓ | ✓ | ✓ | ✓ |
| RNA (r.) | ✓ | ✓ | ✓ | ✓ |
| Protein (p.) | ✓ | Net* | ✗ | ✗ |
* mutalyzer protein normalization requires network access for NP_→NM_ lookups (cannot be cached locally). ** mutalyzer intronic support is enabled by default via genomic-context rewriting; disable with --no-rewrite-intronic.
- biocommons protein normalization: Explicitly not supported by design. Use ferro or hgvs-rs instead.
- hgvs-rs/biocommons intronic: Neither tool normalizes intronic variants. Use ferro-hgvs or mutalyzer.
- mutalyzer protein normalization: Requires network access even with a local cache. Mutalyzer back-translates protein variants to DNA, which requires looking up NP_→NM_ transcript relationships from NCBI's API (
get_cds_to_mrna). This lookup cannot be cached locally. Use ferro or hgvs-rs for offline protein normalization.
ferro-hgvs achieves comprehensive pattern support through its reference data preparation:
| Reference Data | Patterns Enabled | Without ferro |
|---|---|---|
| RefSeq transcripts (NM_, NR_, XM_) | All transcript-based variants | Network fetch per variant |
| GRCh38/GRCh37 genomes | NC_ genomic variants | Network fetch or failure |
| RefSeqGene (NG_) | Gene region variants | Network fetch or failure |
| LRG sequences | LRG_ stable references | Network fetch or failure |
| Protein sequences (NP_, XP_) | Protein normalization | Network fetch (mutalyzer) or failure (biocommons) |
| cdot transcript metadata | Intronic coordinate mapping | Not available (hgvs-rs/biocommons fail on intronic) |
| Legacy transcript versions | Historical ClinVar variants | Network fetch or failure |
When you run ferro prepare, all this reference data is downloaded and organized. Other tools
(mutalyzer, biocommons, hgvs-rs) can then use this cached data for local operation, dramatically
improving their performance from ~1 pattern/sec (network) to ~20 patterns/sec (local).
At each comparison step, we collect:
- Runtime: Total time and patterns/second
- Success rate: Percentage of patterns successfully parsed/normalized
- Cross-tool agreement: How often tools produce the same result
Minimal commands to run a small comparison (for experienced users).
⚠️ Quick Start Limitations:
- Only sets up ferro + mutalyzer (not biocommons/hgvs-rs)
- For full ClinVar testing, see Prepare All Tools
# Clone (skip large LFS files for faster setup)
GIT_LFS_SKIP_SMUDGE=1 git clone <repo-url>
cd ferro-hgvs
# Build
cargo build --release --features benchmark,hgvs-rs
# 1. Prepare ferro reference data (main binary - no benchmark feature needed)
ferro prepare --output-dir data/ferro
# 2. Prepare mutalyzer cache (uses proteins from ferro)
ferro-benchmark prepare mutalyzer \
--ferro-reference data/ferro \
--output-dir data/mutalyzer \
--shards 32
# 3. Create test patterns
echo "NM_000059.4:c.100del" > patterns.txt
echo "NM_001374258.1:c.1A>G" >> patterns.txt
# 4. Parse and normalize with ferro (for benchmark comparison, use ferro-benchmark)
ferro-benchmark parse ferro -i patterns.txt -o ferro_parse.json
ferro-benchmark normalize ferro -i patterns.txt -o ferro_norm.json --reference data/ferro
# 5. Parse and normalize with mutalyzer
ferro-benchmark parse mutalyzer -i patterns.txt -o mutalyzer_parse.json
ferro-benchmark normalize mutalyzer -i patterns.txt -o mutalyzer_norm.json \
--mutalyzer-settings data/mutalyzer/mutalyzer_settings.conf -j 4
# 6. Compare results
ferro-benchmark compare results parse ferro_parse.json mutalyzer_parse.json -o parse_comparison.json
ferro-benchmark compare results normalize ferro_norm.json mutalyzer_norm.json -o norm_comparison.jsonFor biocommons/hgvs-rs: These require UTA database setup. Download the UTA dump manually from https://dl.biocommons.org/uta/uta_20210129b.pgd.gz (requires human verification), then see Prepare biocommons.
Throughput tip — sort the input. ferro's normalize cost is dominated by transcript resolution (reading the full transcript sequence from the reference and rebuilding its metadata). Resolved transcripts are memoized, so input sorted by transcript accession (or by genomic position, which clusters variants onto shared transcripts) maximizes the cache hit rate and can be an order of magnitude faster on large batches than randomly-ordered input. This matters most when the number of distinct transcripts exceeds the cache capacity; sort large or genome-wide inputs to keep the hot transcript set resident.
| Requirement | Minimum | Recommended |
|---|---|---|
| Disk Space | 5GB | 50GB (full setup) |
| RAM | 4GB | 16GB |
| CPU Cores | 2 | 8+ |
| Software | Version | Required For |
|---|---|---|
| Rust | 1.70+ | Building ferro-benchmark |
| Python | 3.10+ | mutalyzer, biocommons validators |
| Docker | Any | biocommons/hgvs-rs local UTA setup |
| pixi | Any | Python dependency management (recommended) |
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env
# Install pixi (recommended for Python deps)
curl -fsSL https://pixi.sh/install.sh | bash
# Install Docker (macOS)
brew install --cask docker
# Start Docker Desktop from ApplicationsThe prepare mutalyzer command fetches sequences from NCBI. By default, NCBI limits unauthenticated requests to 3/second. With an API key, you get 10/second (3x faster).
# 1. Create an NCBI account at https://www.ncbi.nlm.nih.gov/account/
# 2. Go to Settings → API Key Management → Create API Key
# 3. Set the environment variable:
export NCBI_API_KEY="your-api-key-here"
# Add to shell profile for persistence:
echo 'export NCBI_API_KEY="your-api-key"' >> ~/.zshrc # or ~/.bashrcThe prepare mutalyzer --patterns command will automatically use this key for faster fetching.
# Standard build (without hgvs-rs)
cargo build --release --features benchmark
# Full build with hgvs-rs support (recommended)
cargo build --release --features benchmark,hgvs-rs
# Verify build
./target/release/ferro-benchmark --helppixi shell # Activates Python environment with mutalyzer, biocommons
cargo build --release --features benchmark,hgvs-rs# Download ClinVar HGVS data
mkdir -p data/clinvar
curl -o data/clinvar/hgvs4variation.txt.gz \
https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/hgvs4variation.txt.gz
# Check size (~500MB compressed)
ls -lh data/clinvar/hgvs4variation.txt.gz
# Extract all patterns (~57.5M)
ferro-benchmark extract clinvar \
-i data/clinvar/hgvs4variation.txt.gz \
-o data/clinvar/clinvar_patterns.txt
# Check count
wc -l data/clinvar/clinvar_patterns.txtSample files should be generated separately for parse and normalize tests:
- Parse samples: Include all pattern types (proteins can be parsed by all tools)
- Normalize samples: Exclude protein patterns (use
--exclude-protein)
Protein patterns are excluded from normalize samples because protein normalization is handled inconsistently across tools - ferro and hgvs-rs return proteins unchanged (no-op), biocommons explicitly rejects them as "unsupported", and mutalyzer requires network access. Excluding proteins ensures a fair comparison of actual normalization capabilities.
# Create sample directories
mkdir -p samples/parse samples/normalize
# ===== PARSE SAMPLES (include proteins) =====
# Smoke test (100 patterns)
ferro-benchmark extract sample -i data/clinvar/clinvar_patterns.txt \
-o samples/parse/sample_100.txt --size 100 --seed 42
# Small (1K patterns)
ferro-benchmark extract sample -i data/clinvar/clinvar_patterns.txt \
-o samples/parse/sample_1k.txt --size 1000 --seed 42
# Medium (10K patterns)
ferro-benchmark extract sample -i data/clinvar/clinvar_patterns.txt \
-o samples/parse/sample_10k.txt --size 10000 --seed 42
# ===== NORMALIZE SAMPLES (exclude proteins) =====
# Smoke test (100 patterns, no proteins)
ferro-benchmark extract sample -i data/clinvar/clinvar_patterns.txt \
-o samples/normalize/sample_100.txt --size 100 --seed 42 --exclude-protein
# Small (1K patterns, no proteins)
ferro-benchmark extract sample -i data/clinvar/clinvar_patterns.txt \
-o samples/normalize/sample_1k.txt --size 1000 --seed 42 --exclude-protein
# Medium (10K patterns, no proteins)
ferro-benchmark extract sample -i data/clinvar/clinvar_patterns.txt \
-o samples/normalize/sample_10k.txt --size 10000 --seed 42 --exclude-proteinWhy exclude proteins from normalize samples? Protein normalization is effectively a no-op (substitutions like
p.Glu483Lysdon't need 3' shifting). Different tools handle this inconsistently:
- ferro, hgvs-rs: Accept and return unchanged → "success"
- biocommons: Explicitly rejects as "unsupported" → "failure"
- mutalyzer: Requires network for NP_→NM_ lookup
Using
--exclude-proteinensures normalize comparisons measure actual normalization capability.
Genomic patterns (NC_) and mutalyzer: The mutalyzer cache is built for transcript-based normalization. Genomic patterns like
NC_000007.14:g.12345A>Grequire either:
- Full chromosome sequences in the cache (not included by default)
- Network access (
--allow-network) to fetch sequences from EBI- Rewriting to transcript context via
--rewrite-genomic(if transcript mappings exist)For mutalyzer cache-only benchmarks, consider filtering NC_ patterns from normalize samples or using
--allow-networkand accepting slower performance.
Understanding what requires network access helps plan offline deployments and minimize external dependencies.
| Command | Network Fetches | Notes |
|---|---|---|
prepare ferro |
Required | Downloads RefSeq transcripts, cdot, genomes, RefSeqGene, LRG from NCBI/EBI |
prepare ferro --patterns |
Required | Additionally fetches legacy transcript versions detected in patterns |
prepare mutalyzer |
Usually none | Uses ferro reference; may fetch legacy versions not in patterns (see below) |
prepare biocommons |
Required | UTA Docker image + SeqRepo snapshot (~10GB each) |
prepare hgvs-rs |
Usually none | Shares UTA/SeqRepo with biocommons |
The mutalyzer cache includes a hardcoded list of 33 legacy transcript versions (LEGACY_TRANSCRIPT_VERSIONS in src/benchmark/cache.rs). These are older RefSeq versions commonly found in ClinVar that have been superseded.
To ensure zero NCBI fetches during prepare mutalyzer:
# Option 1: Include all legacy versions in your patterns file
# Add these lines to your patterns file before running prepare ferro:
echo "NM_000051.3:c.1A>G" >> my_patterns.txt
echo "NM_000174.4:c.1A>G" >> my_patterns.txt
# ... (see full list in src/benchmark/cache.rs:942-977)
# Then prepare ferro with patterns and clinvar (derives proteins locally)
ferro-benchmark prepare ferro --output-dir data/ferro \
--patterns my_patterns.txt \
--clinvar data/clinvar/hgvs4variation.txt.gz
# Mutalyzer will use ferro's cached sequences and proteins
ferro-benchmark prepare mutalyzer --ferro-reference data/ferro --output-dir data/mutalyzer --shards 32# Option 2: Accept that mutalyzer may fetch ~33 legacy versions
# This takes about 10 seconds with rate limiting
ferro-benchmark prepare mutalyzer --ferro-reference data/ferro --output-dir data/mutalyzer --shards 32These tools require one-time downloads that can be cached:
-
UTA Database (~10GB):
- Download manually: https://dl.biocommons.org/uta/uta_20210129b.pgd.gz
- Or use Docker:
docker pull biocommons/uta:uta_20210129b
-
SeqRepo Snapshot (~10GB):
- Downloads automatically during
prepare biocommons - Cache at:
<seqrepo-dir>/2021-01-29/
- Downloads automatically during
-
Additional sequences from ferro (automatic):
# By default, all ferro transcripts (~270K) are loaded into SeqRepo ferro-benchmark prepare biocommons \ --seqrepo-dir data/seqrepo \ --ferro-reference data/ferro # Use --no-load-transcripts to skip transcript loading (only supplemental sequences)
The LEGACY_TRANSCRIPT_VERSIONS constant is manually maintained. It was created from historical ClinVar test patterns and contains RefSeq versions that:
- Are referenced in ClinVar variant annotations
- Have been superseded by newer versions in the current cdot release
- Are not available in the standard RefSeq transcript downloads
Current list (33 accessions):
NM_000051.3, NM_000174.4, NM_000246.3, NM_000267.3, NM_000280.5,
NM_000426.3, NM_000449.3, NM_001127221.1, NM_001271223.2, NM_001458.4,
NM_002529.3, NM_002878.3, NM_004006.2, NM_004082.4, NM_004321.7,
NM_004364.4, NM_015125.4, NM_015243.2, NM_015247.2, NM_017890.4,
NM_018400.3, NM_018993.3, NM_019023.4, NM_021946.4, NM_030962.3,
NM_030984.5, NM_032790.3, NM_033517.2, NM_138387.3, NM_152383.4,
NM_172201.1, NM_177438.2, NM_203447.3
To add new legacy versions:
- Edit
src/benchmark/cache.rsand add toLEGACY_TRANSCRIPT_VERSIONS - Rebuild with
cargo build --release --features benchmark - Re-run
prepare mutalyzerto fetch the new versions
ferro's reference preparation commands provide significant speedups by eliminating
network latency. Use the --allow-network flag to measure baseline performance
without local caching.
# With ferro cache (fast, ~20 patterns/sec)
ferro-benchmark normalize mutalyzer -i sample_100.txt -o cached.json \
--mutalyzer-settings data/mutalyzer/mutalyzer_settings.conf -j 1
# Without cache - network mode (slow, ~1 pattern/sec)
ferro-benchmark normalize mutalyzer -i sample_100.txt -o network.json \
--mutalyzer-settings data/mutalyzer/mutalyzer_settings.conf --allow-network -j 1# With local UTA + SeqRepo (fast, ~20 patterns/sec)
ferro-benchmark normalize biocommons -i sample_100.txt -o local.json \
--biocommons-settings data/ferro/biocommons_settings.conf
# With remote UTA (slow, ~0.2 patterns/sec)
# Simply omit --biocommons-settings to use remote uta.biocommons.org
ferro-benchmark normalize biocommons -i sample_100.txt -o remote.json| Tool | Network Mode | With ferro prep | Speedup |
|---|---|---|---|
| Mutalyzer | ~1 pattern/sec | ~20 patterns/sec | 20x |
| biocommons/hgvs | ~0.2 patterns/sec | ~20 patterns/sec | 100x |
| hgvs-rs | ~0.2 patterns/sec | ~2 patterns/sec | 10x |
Note: Network mode is rate-limited by NCBI (3-10 req/sec) and remote UTA server capacity. Actual speeds may vary based on network conditions and server load.
- Benchmarking: Measure the actual speedup from ferro's preparation
- Testing: Verify that patterns work without local cache
- Debugging: Identify cache misses that cause network fallback
When using --allow-network, the output JSON includes network statistics:
{
"network_calls": 42,
"network_stats": [
{"url": "https://eutils.ncbi.nlm.nih.gov/...", "elapsed_seconds": 0.5, "success": true}
]
}The UTA snapshot (2021-01-29) lacks transcript-to-genome alignments for newer transcripts.
ferro automatically loads missing alignments from cdot during prepare hgvs-rs.
hgvs-rs requires transcript alignments (exon structures) in UTA to normalize variants. When a transcript is missing from UTA:
Error: replacing reference failed for NM_001407958.1
The cdot JSON file contains these alignments for newer transcripts. ferro bridges the gap by loading them into UTA.
- Loads cdot JSON to find all transcript alignments
- Queries UTA for transcripts that already have alignments
- Inserts missing transcripts and their exon structures
- Creates exon alignments linking transcript to genomic coordinates
# Check how many transcripts were loaded by ferro
docker exec ferro-uta psql -U anonymous -d uta -c \
"SELECT COUNT(*) FROM uta_20210129b.transcript \
WHERE origin_id = (SELECT origin_id FROM uta_20210129b.origin WHERE name = 'cdot-ferro');"
# Verify a specific transcript
docker exec ferro-uta psql -U anonymous -d uta -c \
"SELECT tx_ac, alt_ac, alt_aln_method FROM uta_20210129b.exon_set \
WHERE tx_ac = 'NM_001407958.1';"# Skip UTA alignment loading (not recommended)
ferro-benchmark prepare hgvs-rs \
--seqrepo-dir data/seqrepo \
--ferro-reference data/ferro \
--no-load-alignmentsWarning: Without alignment loading, hgvs-rs will fail on transcripts added after the UTA snapshot date (January 2021).
For optimal performance with minimal network fetches, prepare ferro reference first, then use it for all other tool preparations.
| Data Type | Source | Mutalyzer | Biocommons | hgvs-rs |
|---|---|---|---|---|
| RefSeq transcripts | ferro | ✅ | ✅ | ✅ |
| Supplemental transcripts | ferro | ✅ | ✅ | ✅ |
| Legacy transcripts | ferro (if in patterns) | ✅ | ✅ | ✅ |
| Legacy GenBank | ferro | ✅ | ✅ | ✅ |
| Protein sequences | ferro (derived from CDS) | ✅ | N/A | N/A |
| LRG sequences | ferro | ✅ | ✅ | ✅ |
| LRG annotations | ferro XML | ✅ | N/A | N/A |
| Genomic (NC_) | ferro | Skipped | ✅ | ✅ |
| Transcript alignments | cdot | N/A | N/A | ✅ (UTA) |
Note: Mutalyzer cannot normalize protein (p.) patterns - see Design Limitations. However, it uses protein sequences for reference validation.
With a fully prepared ferro reference:
- Zero NCBI nucleotide fetches in mutalyzer prepare (if all legacy versions are in patterns)
- Zero protein fetches (derived from CDS translation)
- Zero EBI fetches for LRG data
- Biocommons/hgvs-rs: Zero NCBI fetches after initial SeqRepo/UTA setup
The prepare ferro command downloads and caches:
- RefSeq transcripts: Current transcript sequences from NCBI (~1GB)
- cdot metadata: Transcript-to-genome mappings (~200MB)
- Genomes: GRCh37 and GRCh38 chromosome sequences (~4GB)
- RefSeqGene: Gene-specific reference sequences (~600MB)
- LRG sequences: Locus Reference Genomic sequences (~50MB)
- Legacy transcript versions (with --patterns): Older RefSeq versions detected in patterns
- Legacy GenBank sequences: Historical non-RefSeq accessions (U31929.1, M75126.1, etc.)
- Protein sequences (with --clinvar): Derived from transcript CDS translation (~200MB)
The prepare all command sets up all four tools in sequence with sensible defaults:
# Prepare all tools (ferro → mutalyzer → biocommons → hgvs-rs)
# Requires Docker for biocommons/hgvs-rs UTA setup
ferro-benchmark prepare all \
--output-dir data/benchmark \
--seqrepo-dir data/seqrepo \
--uta-dump path/to/uta_20210129b.pgd.gz \
--patterns data/clinvar/clinvar_patterns.txt \
--clinvar data/clinvar/hgvs4variation.txt.gzThis creates:
data/benchmark/ferro/- Ferro reference datadata/benchmark/mutalyzer/- Mutalyzer cachedata/benchmark/biocommons_settings.conf- Biocommons settingsdata/benchmark/hgvs-rs_settings.conf- hgvs-rs settingsdata/seqrepo/- Shared SeqRepo for biocommons and hgvs-rs
After completion, verify with:
ferro-benchmark check all \
--reference data/benchmark \
--seqrepo-path data/seqrepo/2021-01-29# Basic preparation (available in main ferro binary - no benchmark feature needed)
ferro prepare --output-dir data/ferro
# Full setup for ClinVar testing with supplemental accessions (~7GB total)
# Store patterns and clinvar for auto-detection by other tools
ferro-benchmark prepare ferro --output-dir data/ferro \
--patterns data/clinvar/clinvar_patterns.txt \
--clinvar data/clinvar/hgvs4variation.txt.gzNote: By default,
preparedownloads all reference data:
- GRCh38 + GRCh37 genomes (~4GB total)
- RefSeqGene sequences (~600MB)
- LRG sequences (~50MB)
Use
--genome noneto skip genomes,--no-refseqgeneto skip RefSeqGene, or--no-lrgto skip LRG.
What each option does:
--genome all(default) - Both GRCh38 and GRCh37 genomes--genome grch38- GRCh38 only (~3GB)--genome grch37- GRCh37 only (~900MB)--genome none- Skip genome downloads--no-refseqgene- Skip RefSeqGene sequences--no-lrg- Skip LRG sequences--patterns <file>- Store patterns file for provenance and auto-detection--clinvar <file>- Store ClinVar reference and derive protein sequences from transcript CDS
Protein Derivation: When
--clinvaris provided, ferro automatically derives protein sequences from transcript CDS coordinates. For each NM_/XM_ transcript with CDS metadata in cdot, ferro translates the coding sequence to produce the corresponding NP_/XP_ protein sequence. This eliminates the need to fetch protein sequences from NCBI.
Auto-Detection: When you use
--patternsand--clinvar, these files are stored in the ferro reference:
data/ferro/patterns/- patterns file (copied)data/ferro/clinvar/- ClinVar file (symlinked to avoid duplication)Subsequent
prepare mutalyzer,prepare biocommons, andprepare hgvs-rscommands will automatically detect and use these files when you specify--ferro-reference data/ferro.
Note: The mutalyzer cache is built from ferro reference data. Prepare ferro first.
# Full setup for ClinVar testing (recommended)
# Patterns and ClinVar are auto-detected from ferro reference
ferro-benchmark prepare mutalyzer \
--ferro-reference data/ferro \
--output-dir data/mutalyzer \
--shards 32What each flag does:
--ferro-reference- Ferro reference directory (patterns, clinvar, and proteins auto-detected)--shards- Number of logical shards for parallel workers (default: 32)--no-lrg- Skip LRG annotation enhancement (LRG is enhanced by default)--no-proteins- Skip copying proteins from ferro reference--patterns- Override auto-detected patterns with a specific file
Sharding for Memory Efficiency: The
--shardsoption enables memory-efficient parallel normalization. With 32 shards and 8 workers, each worker processes patterns from ~4 shards, loading only ~1/8 of accessions into memory. Mutalyzer loads sequences on-demand, so pattern routing (not cache partitioning) provides the memory savings.
Protein Sequences: Mutalyzer uses protein sequences from the ferro reference. These are automatically derived from transcript CDS during
prepare ferro --clinvar. If ferro has derived proteins, they are copied to the mutalyzer cache automatically.Use
--no-proteinsto skip this step (protein patterns will fail during normalization).
The prepare command automatically:
- Populates the cache from ferro reference data
- Auto-detects patterns from
{ferro-reference}/patterns/directory - Enhances NC_ annotations with transcript mappings for intronic variants
- Enhances LRG annotations (unless
--no-lrg) - Pre-fetches accessions from patterns file
- Copies protein sequences from ferro reference (unless
--no-proteins) - Creates shard manifest for parallel normalization (if
--shards> 1) - Creates
data/mutalyzer/mutalyzer_settings.conf
# Download UTA dump first (requires browser due to human verification):
# https://dl.biocommons.org/uta/uta_20210129b.pgd.gz
# Full setup (patterns auto-detected from ferro reference)
ferro-benchmark prepare biocommons --seqrepo-dir data/seqrepo \
--uta-dump path/to/uta_20210129b.pgd.gz \
--ferro-reference data/ferro
# If UTA is already running (e.g., from previous setup), omit --uta-dump
ferro-benchmark prepare biocommons --seqrepo-dir data/seqrepo \
--ferro-reference data/ferroWhat each option does:
--seqrepo-dir- Where to store/find SeqRepo data (required)--uta-dump- Path to UTA dump file; auto-sets up UTA if not running--ferro-reference- Load sequences from ferro reference; auto-detects patterns frompatterns/subdirectory--patterns <file>- Override auto-detected patterns with a specific file
This creates data/ferro/biocommons_settings.conf with UTA and SeqRepo paths.
Important: The settings file contains absolute paths. If you move the reference data to a different location, you must either:
- Edit the settings file to update
HGVS_SEQREPO_DIRto the new location, OR- Use command-line overrides:
--seqrepo-path /new/path/to/seqrepo/2021-01-29
Note: The
--ferro-referenceoption loads additional sequences into SeqRepo from the ferro reference data:
- NC_ (chromosomes): GRCh38 and GRCh37 genomes
- NG_ (RefSeqGene): Gene-specific reference sequences
- LRG_ (Locus Reference Genomic): Stable genomic references
LRG Support:
- Genomic patterns (
LRG_1:g.5000A>T): Work automatically via SeqRepo aliases- Transcript patterns (
LRG_199t1:c.79del): Use--lrg-mappingto translate to RefSeq:This translatesferro-benchmark normalize biocommons -i patterns.txt -o results.json \ --biocommons-settings data/ferro/biocommons_settings.conf \ --lrg-mapping data/ferro/lrg/lrg_refseq_mapping.txtLRG_199t1→NM_004006.2before normalization.Without these sequences, biocommons will attempt slow network fetches to NCBI/EBI for genomic patterns.
macOS Note: The SeqRepo download uses
seqrepo pullinternally, which requires realrsync(not macOS'sopenrsync). If you encounter rsync errors, use pixi's rsync directly:pixi run seqrepo --rsync-exe $(pixi run which rsync) \ --root-directory data/seqrepo pull --instance-name 2021-01-29
hgvs-rs shares UTA and SeqRepo with biocommons:
# Full setup with cdot alignment loading (recommended)
# Patterns auto-detected from ferro reference, alignments loaded into UTA
ferro-benchmark prepare hgvs-rs --seqrepo-dir data/seqrepo \
--ferro-reference data/ferroNote: hgvs-rs uses the same SeqRepo and UTA as biocommons. The
--ferro-referenceoption:
- Auto-detects patterns from
patterns/subdirectory- Loads transcript alignments from cdot into UTA for transcripts missing from the snapshot (2021-01-29)
By default, prepare hgvs-rs loads transcript alignments from cdot into UTA for
transcripts missing from the UTA snapshot (2021-01-29). This enables hgvs-rs to
normalize variants for newer transcripts (e.g., NM_001407958.1).
# Default behavior: load cdot alignments into UTA
ferro-benchmark prepare hgvs-rs --seqrepo-dir data/seqrepo \
--ferro-reference data/ferro
# Skip alignment loading (not recommended)
ferro-benchmark prepare hgvs-rs --seqrepo-dir data/seqrepo \
--ferro-reference data/ferro --no-load-alignmentsWarning: Without alignment loading, hgvs-rs will fail on transcripts added after the UTA snapshot date (January 2021) with "replacing reference failed" errors.
After preparing all tools, verify everything is ready:
# Check all tools at once (recommended after 'prepare all')
ferro-benchmark check all \
--reference data/benchmark \
--mutalyzer-settings data/benchmark/mutalyzer/mutalyzer_settings.conf \
--seqrepo-path data/seqrepo/2021-01-29
# Or check individual tools:
ferro-benchmark check ferro --reference data/ferro
ferro-benchmark check mutalyzer --mutalyzer-settings data/mutalyzer/mutalyzer_settings.conf
ferro-benchmark check biocommons --seqrepo-path data/seqrepo/2021-01-29
ferro-benchmark check hgvs-rs --seqrepo-path data/seqrepo/2021-01-29All tools should show "ready":
| Tool | Status | Notes |
|---|---|---|
| ferro | ✓ Ready | Transcripts, genome, refseqgene, lrg loaded |
| mutalyzer | ✓ Ready | Cache populated with ~290K entries |
| biocommons | ✓ Ready | UTA + SeqRepo connected |
| hgvs-rs | ✓ Ready | Provider initialization successful |
Run parsing comparisons with all 4 tools at increasing scale.
SAMPLE=samples/parse/sample_100.txt
OUTDIR=results/parse/smoke
mkdir -p $OUTDIR
# Parse with all tools
ferro-benchmark parse ferro -i $SAMPLE -o $OUTDIR/ferro.json
ferro-benchmark parse mutalyzer -i $SAMPLE -o $OUTDIR/mutalyzer.json
ferro-benchmark parse biocommons -i $SAMPLE -o $OUTDIR/biocommons.json
ferro-benchmark parse hgvs-rs -i $SAMPLE -o $OUTDIR/hgvs_rs.json
# Compare all
ferro-benchmark compare results parse \
$OUTDIR/ferro.json $OUTDIR/mutalyzer.json $OUTDIR/biocommons.json $OUTDIR/hgvs_rs.json \
-o $OUTDIR/comparison.jsonSAMPLE=samples/parse/sample_1k.txt
OUTDIR=results/parse/small
mkdir -p $OUTDIR
ferro-benchmark parse ferro -i $SAMPLE -o $OUTDIR/ferro.json
ferro-benchmark parse mutalyzer -i $SAMPLE -o $OUTDIR/mutalyzer.json
ferro-benchmark parse biocommons -i $SAMPLE -o $OUTDIR/biocommons.json
ferro-benchmark parse hgvs-rs -i $SAMPLE -o $OUTDIR/hgvs_rs.json
ferro-benchmark compare results parse \
$OUTDIR/ferro.json $OUTDIR/mutalyzer.json $OUTDIR/biocommons.json $OUTDIR/hgvs_rs.json \
-o $OUTDIR/comparison.jsonSAMPLE=samples/parse/sample_10k.txt
OUTDIR=results/parse/medium
mkdir -p $OUTDIR
ferro-benchmark parse ferro -i $SAMPLE -o $OUTDIR/ferro.json
ferro-benchmark parse mutalyzer -i $SAMPLE -o $OUTDIR/mutalyzer.json
ferro-benchmark parse biocommons -i $SAMPLE -o $OUTDIR/biocommons.json
ferro-benchmark parse hgvs-rs -i $SAMPLE -o $OUTDIR/hgvs_rs.json
ferro-benchmark compare results parse \
$OUTDIR/ferro.json $OUTDIR/mutalyzer.json $OUTDIR/biocommons.json $OUTDIR/hgvs_rs.json \
-o $OUTDIR/comparison.jsonResults from ClinVar patterns sampled with --seed 42:
| Tool | Success | Failed | Time | Patterns/sec |
|---|---|---|---|---|
| ferro | 100 (100%) | 0 | 0.0003s | ~390,000 |
| mutalyzer | 98 (98%) | 2 | 0.39s | ~250 |
| biocommons | 97 (97%) | 3 | 0.04s | ~2,700 |
| hgvs-rs | 97 (97%) | 3 | 0.0002s | ~490,000 |
| Tool | Success | Failed | Time | Patterns/sec |
|---|---|---|---|---|
| ferro | 1000 (100%) | 0 | 0.0008s | ~1,250,000 |
| mutalyzer | 990 (99%) | 10 | 3.4s | ~290 |
| biocommons | 991 (99.1%) | 9 | 0.80s | ~1,240 |
| hgvs-rs | 991 (99.1%) | 9 | 0.0008s | ~1,240,000 |
| Tool | Success | Failed | Time | Patterns/sec |
|---|---|---|---|---|
| ferro | 10000 (100%) | 0 | 0.006s | ~1,630,000 |
| mutalyzer | 9936 (99.4%) | 64 | 33.6s | ~296 |
| biocommons | 9939 (99.4%) | 61 | 4.5s | ~2,200 |
| hgvs-rs | 9939 (99.4%) | 61 | 0.007s | ~1,370,000 |
Ferro parses patterns that other tools reject:
| Pattern Type | Example | Ferro | Others |
|---|---|---|---|
| Frameshift without terminator | p.Met847Glyfs |
✓ Parses | ✗ Expects *N suffix |
| Repeat notation | c.-963GCTACTGCTGCTGCT[3] |
✓ Parses | ✗ Expects = |
| Uncertain range deletions | g.(?_36993051)_(37050663_?)del |
✓ Parses | ✗ Not supported |
| Tool | Failure Category | Count | Example |
|---|---|---|---|
| mutalyzer | Frameshift without terminator | 63 | p.Pro1614Hisfs |
| mutalyzer | Complex allele notation | 1 | NM_006876.2:[c.1168A>G;c.1217C>T] |
| biocommons/hgvs-rs | Repeat notation | 45 | NC_000010.10:g.76788505AGG[2] |
| biocommons/hgvs-rs | Uncertain range | 13 | NC_000007.13:g.(?_26240172)_(26240197_?)del |
| biocommons/hgvs-rs | Complex allele notation | 2 | NM_033028.4:r.[118_261del;118_373del] |
Note: These are valid HGVS patterns found in ClinVar. Ferro's parser is more permissive to handle real-world variant annotations.
When comparing the actual parse outputs across tools (patterns successful in all 4 tools):
| Sample | All Succeed | All Agree | Agreement Rate |
|---|---|---|---|
| Smoke (100) | 96 | 93 | 96.9% |
| Small (1K) | 982 | 963 | 98.1% |
| Medium (10K) | 9,876 | 9,669 | 97.9% |
| Category | Count | Example |
|---|---|---|
| Frameshift Ter suffix | 140 | p.Asn1070fs (ferro/mutalyzer) vs p.Asn1070fsTer (biocommons/hgvs-rs) |
| Deleted sequence stripped | 50 | c.4849_4851delAAG (ferro/mutalyzer/hgvs-rs) vs c.4849_4851del (biocommons) |
| Delins formatting | 7 | c.145_146delinsAA (ferro/biocommons) vs c.145_146delGCinsAA (mutalyzer/hgvs-rs) |
| Dup sequence stripped | 7 | c.3187dupT (ferro/mutalyzer/hgvs-rs) vs c.3187dup (biocommons) |
| Other | 3 | Various minor formatting differences |
Key Insight: Most differences are stylistic variations in HGVS representation:
- Ter suffix: HGVS allows
p.Xfsorp.XfsTerfor frameshifts- Explicit sequences: Deleted/duplicated sequences can be included or omitted
- Star notation:
*andTerare both valid for termination codonsAll outputs represent the same biological variant - they differ only in formatting preference.
Run normalization comparisons with all 4 tools at increasing scale.
Parallel Workers: Use
-j 4for mutalyzer/biocommons to speed up normalization. Each Python worker uses ~12-14GB RAM, so 4 workers requires ~50GB RAM. Adjust based on available memory.
LRG Transcript Support: ClinVar contains many LRG transcript patterns (e.g.,
LRG_199t1:c.79del). UTA doesn't have LRG transcript definitions, so biocommons/hgvs-rs will fail on these patterns. Use--lrg-mapping data/ferro/lrg/lrg_refseq_mapping.txtto translate LRG transcripts to RefSeq equivalents.
UTA Must Be Running: biocommons and hgvs-rs require the UTA PostgreSQL database. If you've restarted your machine since running
prepare biocommons, start UTA first:ferro-benchmark setup start-uta # Verify it's running: docker ps -f name=ferro-uta
Note: Normalize samples exclude protein patterns. See Create Sample Files for details.
SAMPLE=samples/normalize/sample_100.txt
OUTDIR=results/normalize/smoke
mkdir -p $OUTDIR
ferro-benchmark normalize ferro -i $SAMPLE -o $OUTDIR/ferro.json \
--reference data/ferro
ferro-benchmark normalize mutalyzer -i $SAMPLE -o $OUTDIR/mutalyzer.json \
--mutalyzer-settings data/mutalyzer/mutalyzer_settings.conf -j 4
ferro-benchmark normalize biocommons -i $SAMPLE -o $OUTDIR/biocommons.json \
--biocommons-settings data/ferro/biocommons_settings.conf \
--lrg-mapping data/ferro/lrg/lrg_refseq_mapping.txt
ferro-benchmark normalize hgvs-rs -i $SAMPLE -o $OUTDIR/hgvs_rs.json \
--seqrepo-path data/seqrepo/2021-01-29 \
--lrg-mapping data/ferro/lrg/lrg_refseq_mapping.txt
ferro-benchmark compare results normalize \
$OUTDIR/ferro.json $OUTDIR/mutalyzer.json $OUTDIR/biocommons.json $OUTDIR/hgvs_rs.json \
-o $OUTDIR/comparison.jsonSAMPLE=samples/normalize/sample_1k.txt
OUTDIR=results/normalize/small
mkdir -p $OUTDIR
ferro-benchmark normalize ferro -i $SAMPLE -o $OUTDIR/ferro.json \
--reference data/ferro
ferro-benchmark normalize mutalyzer -i $SAMPLE -o $OUTDIR/mutalyzer.json \
--mutalyzer-settings data/mutalyzer/mutalyzer_settings.conf -j 4
ferro-benchmark normalize biocommons -i $SAMPLE -o $OUTDIR/biocommons.json \
--biocommons-settings data/ferro/biocommons_settings.conf \
--lrg-mapping data/ferro/lrg/lrg_refseq_mapping.txt
ferro-benchmark normalize hgvs-rs -i $SAMPLE -o $OUTDIR/hgvs_rs.json \
--seqrepo-path data/seqrepo/2021-01-29 \
--lrg-mapping data/ferro/lrg/lrg_refseq_mapping.txt
ferro-benchmark compare results normalize \
$OUTDIR/ferro.json $OUTDIR/mutalyzer.json $OUTDIR/biocommons.json $OUTDIR/hgvs_rs.json \
-o $OUTDIR/comparison.jsonSAMPLE=samples/normalize/sample_10k.txt
OUTDIR=results/normalize/medium
mkdir -p $OUTDIR
ferro-benchmark normalize ferro -i $SAMPLE -o $OUTDIR/ferro.json \
--reference data/ferro
ferro-benchmark normalize mutalyzer -i $SAMPLE -o $OUTDIR/mutalyzer.json \
--mutalyzer-settings data/mutalyzer/mutalyzer_settings.conf -j 4
ferro-benchmark normalize biocommons -i $SAMPLE -o $OUTDIR/biocommons.json \
--biocommons-settings data/ferro/biocommons_settings.conf \
--lrg-mapping data/ferro/lrg/lrg_refseq_mapping.txt --workers 4
ferro-benchmark normalize hgvs-rs -i $SAMPLE -o $OUTDIR/hgvs_rs.json \
--seqrepo-path data/seqrepo/2021-01-29 \
--lrg-mapping data/ferro/lrg/lrg_refseq_mapping.txt --workers 8
ferro-benchmark compare results normalize \
$OUTDIR/ferro.json $OUTDIR/mutalyzer.json $OUTDIR/biocommons.json $OUTDIR/hgvs_rs.json \
-o $OUTDIR/comparison.json| Tool | 100 patterns | 1K patterns | 10K patterns |
|---|---|---|---|
| ferro | ~0.01s | ~0.1s | ~1s |
| mutalyzer | ~5s | ~50s | ~8 min |
| biocommons | ~5s | ~50s | ~8 min |
| hgvs-rs | ~5s | ~50s | ~8 min |
Important: biocommons is excluded from full ClinVar comparisons due to speed limitations (~27 days estimated for 57.5M patterns).
INPUT=data/clinvar/clinvar_patterns.txt
OUTDIR=results/parse/full
mkdir -p $OUTDIR
# Parse with ferro (fastest)
echo "=== Parsing with ferro ==="
ferro-benchmark parse ferro -i $INPUT -o $OUTDIR/ferro.json
# Parse with mutalyzer
echo "=== Parsing with mutalyzer ==="
ferro-benchmark parse mutalyzer -i $INPUT -o $OUTDIR/mutalyzer.json
# Parse with hgvs-rs
echo "=== Parsing with hgvs-rs ==="
ferro-benchmark parse hgvs-rs -i $INPUT -o $OUTDIR/hgvs_rs.json
# Compare (3 tools)
ferro-benchmark compare results parse \
$OUTDIR/ferro.json $OUTDIR/mutalyzer.json $OUTDIR/hgvs_rs.json \
-o $OUTDIR/comparison.json| Tool | ~57.5M patterns | Patterns/sec |
|---|---|---|
| ferro | ~12 sec | ~4,000,000 |
| mutalyzer | ~40 min | ~20,000 |
| hgvs-rs | ~4 min | ~200,000 |
| biocommons | ~27 days | ~20 (excluded) |
ferro parses 57,523,009 of 57,524,053 patterns successfully (99.998%). The 1,044 parse failures are malformed syntax not conforming to any HGVS grammar (e.g., multi-range uncertain positions, multi-repeat notation, non-standard suffixes).
Important: biocommons is excluded from full ClinVar comparisons due to speed limitations.
INPUT=data/clinvar/clinvar_patterns.txt
OUTDIR=results/normalize/full
mkdir -p $OUTDIR
# Normalize with ferro (fastest)
echo "=== Normalizing with ferro ==="
ferro-benchmark normalize ferro -i $INPUT -o $OUTDIR/ferro.json \
--reference data/ferro
# Normalize with mutalyzer (parallel workers)
echo "=== Normalizing with mutalyzer ==="
ferro-benchmark normalize mutalyzer -i $INPUT -o $OUTDIR/mutalyzer.json \
--mutalyzer-settings data/mutalyzer/mutalyzer_settings.conf -j 24
# Normalize with hgvs-rs
echo "=== Normalizing with hgvs-rs ==="
ferro-benchmark normalize hgvs-rs -i $INPUT -o $OUTDIR/hgvs_rs.json \
--seqrepo-path data/seqrepo/2021-01-29 --workers 24
# Compare (3 tools)
ferro-benchmark compare results normalize \
$OUTDIR/ferro.json $OUTDIR/mutalyzer.json $OUTDIR/hgvs_rs.json \
-o $OUTDIR/comparison.json| Tool | Time | Patterns/sec |
|---|---|---|
| ferro | ~12 sec | ~4,000,000 |
| mutalyzer | ~40 min | ~20,000 |
| hgvs-rs | ~4 hours | ~200 |
| biocommons | ~27 days | ~20 (excluded) |
On 57,524,053 ClinVar patterns with a fully prepared ferro reference:
| Metric | Count | Rate |
|---|---|---|
| Total patterns | 57,524,053 | — |
| Parse failures | 1,044 | 0.002% |
| Normalize success | 57,522,070 | 99.997% |
| Normalize failures | 1,983 | 0.003% |
The 1,983 normalize failures break down as:
| Category | Count | Description |
|---|---|---|
| Parse errors | 1,044 | Malformed HGVS syntax (inherently unparseable) |
| Intronic conversion | 909 | CDS-to-tx position doesn't match intron boundary (cdot coordinate discrepancy) |
| Boundary-spanning n. | 30 | n. variant spans exon-intron boundary (not yet supported) |
Note: Excluding the 1,044 patterns that cannot be parsed by any tool, ferro normalizes 99.998% of all parseable patterns (57,522,070 / 57,523,009).
LFS objects are large. Skip them for faster setup:
GIT_LFS_SKIP_SMUDGE=1 git clone <repo-url>docker --version
docker ps
# macOS: Start Docker Desktop from ApplicationsmacOS ships with openrsync, but SeqRepo requires real rsync:
# Error: "Binary located at /usr/bin/rsync appears to be an `openrsync` instance"
# Solution: Use pixi's rsync
pixi run seqrepo --rsync-exe $(pixi run which rsync) \
--root-directory data/seqrepo pull --instance-name 2021-01-29# Check container status
docker ps -f name=ferro-uta
# Start if stopped
ferro-benchmark setup start-uta
# or
docker start ferro-uta
# Verify connection
docker exec ferro-uta psql -U anonymous -d uta -c "SELECT 1"# Check cache coverage
ls data/mutalyzer/ | wc -l
# Re-run prepare with patterns to fetch missing accessions
ferro-benchmark prepare mutalyzer \
--ferro-reference data/ferro \
--output-dir data/mutalyzer \
--shards 32 \
--patterns your_patterns.txt# Verify directory structure
ls data/seqrepo/2021-01-29/aliases.sqlite3
ls data/seqrepo/2021-01-29/sequences/Error: Unable to open SeqRepo directory /old/path/to/seqrepo/2021-01-29
The biocommons_settings.conf file contains absolute paths that may be outdated
if you moved the reference data.
Solution 1: Update the settings file:
# Check current path
cat data/ferro/biocommons_settings.conf
# Edit HGVS_SEQREPO_DIR to the correct pathSolution 2: Use command-line override:
ferro-benchmark normalize biocommons -i patterns.txt -o results.json \
--biocommons-settings data/ferro/biocommons_settings.conf \
--seqrepo-path /correct/path/to/seqrepo/2021-01-29# Activate pixi environment
pixi shell
# Or install manually
pip install mutalyzer-hgvs-parser hgvs biocommons.seqrepo# Ensure PostgreSQL client is available
# macOS
brew install postgresql
# Linux
apt install libpq-dev| Command | Description |
|---|---|
prepare <tool> |
Download/prepare reference data for a tool |
check <tool> |
Verify tool configuration and dependencies |
parse <tool> |
Parse HGVS patterns with a tool |
normalize <tool> |
Normalize HGVS patterns with a tool |
Where <tool> is one of: ferro, mutalyzer, biocommons, hgvs-rs, or all
Note:
prepare allandcheck allset up and verify all four tools in sequence.parse allandnormalize allare not supported - use individual tools instead.
| Command | Description |
|---|---|
compare results parse |
Compare parsing results between tools |
compare results normalize |
Compare normalization results between tools |
compare arbitration |
View arbitration decisions |
| Command | Description |
|---|---|
extract clinvar |
Extract patterns from ClinVar TSV |
extract json |
Extract patterns from JSON files |
extract proteins |
Extract protein accessions from ClinVar |
extract sample |
Create stratified random sample (use --exclude-protein for normalize) |
extract shard |
Split dataset for parallel processing |
| Command | Description |
|---|---|
setup uta |
Set up UTA PostgreSQL database via Docker |
setup seqrepo |
Download SeqRepo data |
setup start-uta |
Start UTA Docker container |
setup stop-uta |
Stop UTA Docker container |
Note: UTA and SeqRepo setup are also integrated into
prepare biocommonsandprepare hgvs-rsvia the--uta-dumpflag.
| Command | Description |
|---|---|
generate summary |
Create aggregate statistics |
generate report |
Generate markdown report |
generate readme |
Generate README-ready tables |
| Command | Description |
|---|---|
collate parsing |
Collate parsing results from shards |
collate normalization |
Collate normalization results from shards |