This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
CNVkit is a command-line toolkit and Python library for detecting copy number variants and alterations genome-wide from high-throughput sequencing data. It provides both a CLI interface and Python API for genomic analysis workflows.
Supported Python versions: 3.11+ (tested on 3.11-3.14) Minimum versions aligned with Ubuntu 26.04 LTS (Resolute)
- Bug fixes and new features: Write a failing test first, then implement.
- Edge cases: Before finishing, verify behavior for empty inputs, NaN/missing values, and single-element arrays.
- NaN weight safety:
np.average(values, weights=w)andnumpy.sum(w)propagate NaN; usenp.nansumfor weight sums and filter~np.isnan(wt)beforenp.average. Note thatpandas.Series.sum()skips NaN by default — prefer explicitnp.nansumfor clarity. - User-facing changes: Update the relevant docs in
doc/*.rst. - Clinical impact: When reviewing changes, consider whether the changeset alters numerical output or output file formats (
.cnr,.cns,.cnn, SEG, VCF). Flag any such changes explicitly, as downstream clinical pipelines may depend on exact output stability.
See doc/development.rst for the full developer guide (environment
setup, pre-commit hooks, code style, testing matrix, PR process, Docker
release flow). The notes below are the AI-session-specific quirks not
already in that guide.
Conda env quirk: conda run -n cnvkit <command> is unreliable in
this repo (the dev tools sometimes fail to find their dependencies).
Use conda activate cnvkit && <command> instead — including inside
scripts and parallel agent runs.
# type: ignore requirement: pyproject.toml sets
enable_error_code = ["ignore-without-code"], so every # type: ignore
must specify the error code (e.g. # type: ignore[return-value]). Bare
# type: ignore fails mypy.
Common type-ignore patterns in this codebase (the footguns mypy flags most often here):
tabio.read()returns union types —# type: ignore[return-value]at call sites.- Pandas operations often return
Any—# type: ignore[no-any-return]. - Closures don't narrow types in mypy —
# type: ignore[index]or a localassert. - Parameters typed
param: None = Noneproduce unreachable blocks — useOptional[Type] = None(orType | None = None) instead. - Generator functions must use
-> Generator[YieldType, SendType, ReturnType], not-> Iteratoror-> None. - When a variable changes type (e.g.
str→list[int]), rename it to avoid shadowing (e.g.copies→copy_strs). numpy.bool_is not assignable toboolin mypy — use# type: ignore[assignment]or widen parameters tobool | bool_ | None.
-
cnvlib/- Main Python packagecommands.py- CLI definitions and API functions (_cmd_*for arg parsing,do_*for logic)cnvkit.py- CLI entry point that routes to commandscore.py- Core data structures and utilitiessegmentation/- Segmentation algorithms (CBS, HMM, etc.)batch.py,segment.py, etc. - Individual command implementationscmdutil.py,params.py- Utility functionsplots.py,diagram.py,heatmap.py,scatter.py- Visualizationimporters.py,export.py- Data import/exportcnary.py- CopyNumArray (extends GenomicArray with log2 ratios and gene names)vary.py- VariantArray (extends GenomicArray with variant allele data)
-
skgenome/- Genomic data handling library (part of CNVkit but decoupled)gary.py- GenomicArray class for genomic interval data (wraps pandas DataFrame)tabio/- File I/O for BED, GFF, VCF, SEG, Picard, CNVkit formats (.cnn/.cnr/.cns), bedGraphchromnames.py- Chromosome-name classification (autosome/sex/mito/alt-contig detection, arabic + Roman numerals, X/Y label inference)genomebuild.py- Reference assembly metadata (PAR coordinates) asGenomeBuildvalue objects, withget_genome_build()lookupchromsort.py- Chromosome-name sort keys
GenomicArray uses typing.Self (PEP 673) so that methods like .copy(), .concat(), and .as_dataframe() preserve the subclass type through type checking.
All chromosome-name classification goes through skgenome.chromnames —
do NOT add inline regexes or .startswith("chr") / chromosome.iat[0]
heuristics. The classifier is context-aware (it inspects the whole
chromosome set, not one name), because chrX is a sex chromosome in
human but autosome 10 in yeast. See doc/sex.rst "Non-human and
Roman-numeral genomes" for the user-facing behavior.
API points worth knowing:
CopyNumArray.chr_x_label/chr_y_labelreturnstr | None.Nonemeans "no sex chromosome detected in this assembly" — callers must handleNone(thechr_*_filtermethods return all-False in that case rather than crashing).GenomicArray.autosomes()falls back to returning the whole array (with a warning) when no autosomes are recognized. Be permissive on unfamiliar assemblies; don't silently drop data.- PAR coordinates live in
skgenome.genomebuild;cnvlib.params.PSEUDO_AUTSOMAL_REGIONSis a back-compat re-export.
doc/sex.rst is the source of truth for the math (ratio-of-residuals
maleness ratios + AND-gate, VCF heterozygous-SNP confirmer, target /
antitarget reconciliation in do_reference). The rules below are
project invariants that, if broken, silently regress the design
without breaking tests:
verify_sample_sexis the canonical resolver. Don't reinventis_female_default(guess_xx(...))inline; route through it so user--sample-sexand the VCF het-density confirmer both apply.- Honest
Nonepropagates on inference paths; concreteboolonly at decision consumers viais_female_default.do_reference's target/antitarget reconciliation needs the honestNoneto work. - Reporting commands stay honest:
do_sexreportsUnknownfor an undeterminable sample; don't collapseNone→Femaleat the report layer. - VCF het confirmer is one-way (male → female only). Absent chrX hets is non-evidence — true haploid X and "too few SNPs" are indistinguishable.
- Do NOT re-introduce a chi-square or multiplicative
combined_scoreon top of the median; the median is already the robust quantity, and wrapping it in a chi-square brought sample-size dependence that produced.cnrvs.cnsinconsistency (#785).
.cnn- Coverage/reference data.cnr- Copy number ratio data.cns- Segmented copy number data
Core dependencies are declared in requirements/core.txt; min.txt pins exact minimums for compatibility testing.
Packaging is deliberately conservative. CNVkit ships across PyPI, conda, Docker, and Galaxy, and that matrix is far more fragile than the numerical core, so the project avoids formal package-data:
- Small, code-adjacent data stays inline (e.g. PAR coordinates as Python literals in
skgenome.genomebuild; the CBS R script as a string incnvlib/segmentation/cbs.py). - Large reference assets (genome access BEDs, refFlat, gene-info TSVs) are user-supplied or downloaded, not bundled in the wheel, keeping installs small and free of build-specific or licensing baggage.
Introduce package-data only for a concrete need, and pair it with a CI test that installs the built wheel and loads the resource, so a missing data file fails loudly in CI rather than at a user's runtime.
- Type annotations use PEP 604 union syntax:
X | YandX | None, notUnion[X, Y]orOptional[X] - Match/case (PEP 634) is used for dispatch on string literals where it improves clarity
removeprefix()/removesuffix()(PEP 616) preferred over manual slicing for prefix/suffix removal- Dict
|=(PEP 584) preferred over.update()for merging dict literals - All
zip()calls use explicitstrict=Trueorstrict=False(PEP 618)
test/test_commands.pyandtest/test_cnvlib.pyeach have a top-levelfrom cnvlib import (...)block that serves as both a smoke test and the shared import set. Add newcnvlibsubmodule imports there rather than as local imports inside individual test methods.
- The codebase uses
bam_fnameorsample_fnamefor file paths that can be either BAM or bedGraph files - Parameter names in function signatures often use generic terms (e.g.,
bam_fname) even when they accept multiple formats
The analytical methods implemented in CNVkit are described in the publication: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004873
When implementing or modifying analytical methods, look up the primary literature to understand the underlying algorithms. Use Google Scholar and Europe PMC to find and read the original papers for methods referenced in the code (e.g. segmentation algorithms, statistical tests, normalization approaches).