Similarity-first interpretability: find a case’s closest look-alikes (“twins”) and explain why.
Educational demo only, Not a medical device.
This project is an educational visualization built on a public dataset.
It is NOT intended for diagnosis, treatment, screening, triage, or clinical decision-making.
Never use this app to make medical decisions. If need medical support, consult qualified healthcare professionals.
Most ML demos do this:
“Predict malignant vs benign.”
This project does something more interpretable:
“Show me the closest look-alike cases (twins) and explain the similarity structure around a case.”
Instead of pretending the model “knows,” the app reveals neighborhood evidence:
- Who does this case resemble?
- How consistent is the neighborhood?
- Which features make the case look like its twins?
- Is the case sitting near a boundary between groups?
- What minimal feature shifts would move it closer to another neighborhood? (educational geometry, not medical advice)
For a selected case (row):
- Standardizes features to make distances meaningful across mixed units.
- Builds a k-nearest neighbor (kNN) neighborhood (the “twins”).
- Shows neighborhood composition (how many benign vs malignant twins).
- Explains similarity by highlighting the largest feature differences (“drivers”).
- Provides multiple interpretability views:
- Overview (neighborhood composition + summary)
- Twin Gallery (closest benign + malignant lists)
- Difference Fingerprint (standardized deltas, if both groups exist)
- Minimal-Change Lab (directional shifts toward a target group)
- Dataset Explorer (full transparency: browse rows + columns)
- How it works
- Screenshots & walkthrough
- Install & run
- CLI usage
- Project structure
- How to interpret outputs
- Dataset & copyright
- Limitations
Breast cancer datasets typically include:
- a
diagnosislabel (B benign, M malignant) - numeric feature columns (radius_mean, texture_mean, perimeter_mean, etc.)
The app:
- loads the dataset from
data/raw/data.csv - selects numeric features
- standardizes them (so “area_mean” doesn’t dominate distance by scale alone)
Why standardization matters Distance-based models are extremely sensitive to feature scale:
- area-related features can be orders of magnitude larger than smoothness features
- without scaling, the “largest unit features” dominate similarity even if they’re not the meaningful reason
Standardization makes distance represent pattern similarity, not unit mismatch.
We model each case as a vector x in standardized feature space.
For a chosen case:
- find its k nearest neighbors
- compute distances: smaller distance = more similar overall pattern
This creates:
- a “twin list” (nearest neighbors)
- a “distance curve” (how quickly similarity fades as rank increases)
To explain why a neighbor is close to the query, we compute per-feature differences in standardized space.
The app summarizes:
- “Top drivers” = features that change the most (by absolute delta)
- not causal, but highly interpretable as geometry
Think of it as:
“Which dimensions separate this case from its nearest look-alikes?”
If the neighborhood includes at least one benign and one malignant example:
- the app can compare “nearest benign” vs “nearest malignant”
- producing a difference fingerprint: the features that most separate those two reference twins
If the neighborhood is all one class (all malignant):
- the app shows a warning (because contrast needs both groups)
- that warning is itself interpretability: it indicates local homogeneity
This tool answers a geometry question:
“What direction in feature space would move this case closer to the target neighborhood centroid?”
It does NOT say:
- “change this feature in the real world”
- “this is an intervention”
- “this is causal”
It visualizes:
- which feature dimensions dominate the shift toward benign-like or malignant-like neighborhoods.
What you’re seeing
-
Left sidebar “Controls”:
- Case row index: which sample you are analyzing
- Neighbors (k): how many closest twins to show
- Top drivers: how many explanation features to list
- Radar features: how many features to display in radar plots
-
Main “Overview”:
- shows the query’s dataset label
- counts benign vs malignant twins
- displays malignant share (percentage of malignant neighbors)
How to interpret
- Malignant share ~100% means the case sits inside a malignant-like region of feature space.
- Mixed neighbors suggest boundary behavior: the case sits near regions of both types.
- A “homogeneous neighborhood” (all one label) is a strong interpretability signal: the case has many close look-alikes of that same label.
Why this matters This is “context before conclusion.” Even if someone later adds a classifier, the neighborhood evidence helps you judge:
- “Is this case supported by consistent local examples?”
- “Or is it geometrically ambiguous?”
What you’re seeing
- X-axis: neighbor rank (1 = closest)
- Y-axis: distance in standardized feature space
How to interpret the curve
- Flat curve early: many very close twins exist → strong local cluster.
- Sharp jump: after a small number of neighbors, similarity quickly drops → the real neighborhood might be small (k should be smaller).
- Smooth gradual increase: similarity decays slowly → larger k can still represent “local context.”
How to use it to choose k
- If the distance “knee” happens at rank 3–5, choose k around that range for meaningful neighborhood analysis.
- If there’s no knee, k=10–20 can still be reasonable for exploration.
What you’re seeing
-
Two lists/tables:
- Closest benign twins
- Closest malignant twins
-
Each neighbor includes:
- distance
- diagnosis label
- feature values (to inspect “what makes them similar”)
Important behavior Sometimes you’ll see “No benign twins in this neighborhood.” That’s not a bug; it means:
- within the chosen k, all closest cases are malignant (or vice versa)
- the local region is label-homogeneous
What you can do
- Increase k to “reach further” into the space
- Try another row index
- Use the distance curve to pick a meaningful k
What you’re seeing
-
Radar charts compare:
- the query
- the nearest benign
- the nearest malignant
-
The radar is a “signature” view: it emphasizes pattern geometry.
How to interpret
- If the query radar shape closely matches the malignant twin, the case is geometrically malignant-like.
- If it lies between them, the case may be near a boundary.
- Radar works best when you restrict features (too many features makes radar unreadable), which is why you can select “Radar features.”
What you’re seeing
-
A warning message appears if the neighborhood doesn’t include at least one benign and one malignant reference.
-
This page aims to compute:
- standardized deltas between the query and reference neighbors
- plus a contrast between benign-like and malignant-like reference twins
Why this is useful Instead of viewing single-case features in isolation, you learn:
- “Which features actually separate the nearest benign and malignant examples around this query?” That’s interpretability based on local evidence, not global averages.
If you keep seeing the warning
- increase k until you capture at least one neighbor of each type
- or select a case that lives nearer the boundary
What you’re seeing
- You choose a target resemblance group (malignant-like)
- The bar chart shows directional deltas per feature
Interpretation
- Big deltas: features that most define the difference between the query and the target centroid
- Small deltas: features already aligned with the target group
Critical caution This is counterfactual geometry, not causal intervention. It does NOT imply:
- “change this biological trait”
- “this is treatment” It only visualizes: “these dimensions matter most for shifting similarity in this dataset.”
Same tool, different target group.
Why this view is powerful Comparing benign-like vs malignant-like shifts helps see:
- whether the query is “closer” to one group than the other
- which features dominate movement toward each group
If benign-like shifts are huge but malignant-like are small:
- the case is geometrically much closer to malignant regions
What ’re seeing
- dataset size information (rows, features)
- class distribution (benign vs malignant)
- a scrollable table view of the dataset
Why it matters Interpretability without transparency can be misleading. This page ensures can:
- verify columns
- inspect values
- confirm preprocessing assumptions
- understand the dataset foundation behind every chart
python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activatepip install -r requirements.txtstreamlit run app/app.pyExamples:
python -m src.cli prepare
python -m src.cli indexUse these when want deterministic rebuilds of:
- cleaned dataset
- neighbor index artifacts
Tumor-Doppelgänger-Studio/
├─ app/
│ └─ app.py # Streamlit UI (tabs, plots, controls)
├─ src/
│ ├─ cli.py # CLI entrypoints (prepare/index/etc.)
│ ├─ config.py # Paths + constants
│ ├─ data_prep.py # Load/clean/standardize features
│ ├─ similarity.py # kNN index + neighbor queries
│ ├─ explain.py # Driver explanations + deltas
│ └─ utils.py # helpers
├─ data/
│ ├─ raw/
│ │ └─ data.csv # Kaggle dataset copy
│ └─ processed/
│ └─ clean.csv # processed dataset used by app
├─ models/
│ └─ twin_index.joblib # saved neighbor index (rebuildable)
├─ requirements.txt
├─ LICENSE
└─ README.md
- All malignant neighbors: strong malignant-like region (in this dataset’s geometry)
- All benign neighbors: strong benign-like region
- Mixed neighbors: boundary behavior (the most interesting cases for interpretability)
- If distance rises sharply after rank ~3, r “real neighborhood” is small.
- If distance grows gradually, k=10–20 remains locally meaningful.
Drivers are not “cause.” They’re “what feature dimensions separate the query and its neighbors.”
It answers:
- “what shift would move this vector toward a different neighborhood?” It does not imply real-world action.
Dataset link (Kaggle): https://www.kaggle.com/datasets/neurocipher/breast-cancer-dataset
Copyright / licensing note
- The dataset belongs to its original authors/uploaders.
- Kaggle datasets can have specific licenses/terms-of-use on the dataset page.
- This repository uses the dataset for educational/demo purposes.
- If publish/redistribute, review and comply with the dataset’s Kaggle license and terms.
- Not clinical: no medical decisions.
- Dataset-bound: similarity is only meaningful relative to the dataset’s feature distributions.
- Distance is a proxy: closeness depends on chosen features + scaling choice.
- Not causal: deltas and minimal-change are geometry-based explanations.