Tumor Doppelgänger Studio

Similarity-first interpretability: find a case’s closest look-alikes (“twins”) and explain why.
Educational demo only, Not a medical device.

Medical disclaimer (read first)

This project is an educational visualization built on a public dataset.
It is NOT intended for diagnosis, treatment, screening, triage, or clinical decision-making.

Never use this app to make medical decisions. If need medical support, consult qualified healthcare professionals.

Why this project is different

Most ML demos do this:

“Predict malignant vs benign.”

This project does something more interpretable:

“Show me the closest look-alike cases (twins) and explain the similarity structure around a case.”

Instead of pretending the model “knows,” the app reveals neighborhood evidence:

Who does this case resemble?
How consistent is the neighborhood?
Which features make the case look like its twins?
Is the case sitting near a boundary between groups?
What minimal feature shifts would move it closer to another neighborhood? (educational geometry, not medical advice)

What the app does (high-level)

For a selected case (row):

Standardizes features to make distances meaningful across mixed units.
Builds a k-nearest neighbor (kNN) neighborhood (the “twins”).
Shows neighborhood composition (how many benign vs malignant twins).
Explains similarity by highlighting the largest feature differences (“drivers”).
Provides multiple interpretability views:
- Overview (neighborhood composition + summary)
- Twin Gallery (closest benign + malignant lists)
- Difference Fingerprint (standardized deltas, if both groups exist)
- Minimal-Change Lab (directional shifts toward a target group)
- Dataset Explorer (full transparency: browse rows + columns)

How it works

1) Feature preparation

Breast cancer datasets typically include:

a diagnosis label (B benign, M malignant)
numeric feature columns (radius_mean, texture_mean, perimeter_mean, etc.)

The app:

loads the dataset from data/raw/data.csv
selects numeric features
standardizes them (so “area_mean” doesn’t dominate distance by scale alone)

Why standardization matters Distance-based models are extremely sensitive to feature scale:

area-related features can be orders of magnitude larger than smoothness features
without scaling, the “largest unit features” dominate similarity even if they’re not the meaningful reason

Standardization makes distance represent pattern similarity, not unit mismatch.

2) Similarity via kNN neighborhoods

We model each case as a vector x in standardized feature space.

For a chosen case:

find its k nearest neighbors
compute distances: smaller distance = more similar overall pattern

This creates:

a “twin list” (nearest neighbors)
a “distance curve” (how quickly similarity fades as rank increases)

3) “Drivers” (explanation by feature deltas)

To explain why a neighbor is close to the query, we compute per-feature differences in standardized space.

The app summarizes:

“Top drivers” = features that change the most (by absolute delta)
not causal, but highly interpretable as geometry

Think of it as:

“Which dimensions separate this case from its nearest look-alikes?”

4) Group contrast: benign-like vs malignant-like

If the neighborhood includes at least one benign and one malignant example:

the app can compare “nearest benign” vs “nearest malignant”
producing a difference fingerprint: the features that most separate those two reference twins

If the neighborhood is all one class (all malignant):

the app shows a warning (because contrast needs both groups)
that warning is itself interpretability: it indicates local homogeneity

5) Minimal-change lab (counterfactual geometry, educational)

This tool answers a geometry question:

“What direction in feature space would move this case closer to the target neighborhood centroid?”

It does NOT say:

“change this feature in the real world”
“this is an intervention”
“this is causal”

It visualizes:

which feature dimensions dominate the shift toward benign-like or malignant-like neighborhoods.

Screenshots & walkthrough

1) Overview, neighborhood composition & “local context”

Screenshot 2025-12-21 at 14-00-57 Tumor Doppelgänger Studio

What you’re seeing

Left sidebar “Controls”:
- Case row index: which sample you are analyzing
- Neighbors (k): how many closest twins to show
- Top drivers: how many explanation features to list
- Radar features: how many features to display in radar plots
Main “Overview”:
- shows the query’s dataset label
- counts benign vs malignant twins
- displays malignant share (percentage of malignant neighbors)

How to interpret

Malignant share ~100% means the case sits inside a malignant-like region of feature space.
Mixed neighbors suggest boundary behavior: the case sits near regions of both types.
A “homogeneous neighborhood” (all one label) is a strong interpretability signal: the case has many close look-alikes of that same label.

Why this matters This is “context before conclusion.” Even if someone later adds a classifier, the neighborhood evidence helps you judge:

“Is this case supported by consistent local examples?”
“Or is it geometrically ambiguous?”

2) Neighbor distance curve, how fast similarity decays

Screenshot 2025-12-21 at 14-01-18 Tumor Doppelgänger Studio

What you’re seeing

X-axis: neighbor rank (1 = closest)
Y-axis: distance in standardized feature space

How to interpret the curve

Flat curve early: many very close twins exist → strong local cluster.
Sharp jump: after a small number of neighbors, similarity quickly drops → the real neighborhood might be small (k should be smaller).
Smooth gradual increase: similarity decays slowly → larger k can still represent “local context.”

How to use it to choose k

If the distance “knee” happens at rank 3–5, choose k around that range for meaningful neighborhood analysis.
If there’s no knee, k=10–20 can still be reasonable for exploration.

3) Twin Gallery, closest benign and malignant look-alikes

Screenshot 2025-12-21 at 14-02-08 Tumor Doppelgänger Studio

What you’re seeing

Two lists/tables:
- Closest benign twins
- Closest malignant twins
Each neighbor includes:
- distance
- diagnosis label
- feature values (to inspect “what makes them similar”)

Important behavior Sometimes you’ll see “No benign twins in this neighborhood.” That’s not a bug; it means:

within the chosen k, all closest cases are malignant (or vice versa)
the local region is label-homogeneous

What you can do

Increase k to “reach further” into the space
Try another row index
Use the distance curve to pick a meaningful k

4) Radar comparison, shape signature of query vs reference twins

Screenshot 2025-12-21 at 14-02-29 Tumor Doppelgänger Studio

What you’re seeing

Radar charts compare:
- the query
- the nearest benign
- the nearest malignant
The radar is a “signature” view: it emphasizes pattern geometry.

How to interpret

If the query radar shape closely matches the malignant twin, the case is geometrically malignant-like.
If it lies between them, the case may be near a boundary.
Radar works best when you restrict features (too many features makes radar unreadable), which is why you can select “Radar features.”

5) Difference Fingerprint, standardized contrast (if both groups exist)

Screenshot 2025-12-21 at 14-02-39 Tumor Doppelgänger Studio

What you’re seeing

A warning message appears if the neighborhood doesn’t include at least one benign and one malignant reference.
This page aims to compute:
- standardized deltas between the query and reference neighbors
- plus a contrast between benign-like and malignant-like reference twins

Why this is useful Instead of viewing single-case features in isolation, you learn:

“Which features actually separate the nearest benign and malignant examples around this query?” That’s interpretability based on local evidence, not global averages.

If you keep seeing the warning

increase k until you capture at least one neighbor of each type
or select a case that lives nearer the boundary

6) Minimal-Change Lab, shift toward malignant-like neighborhood (educational)

Screenshot 2025-12-21 at 14-03-24 Tumor Doppelgänger Studio

What you’re seeing

You choose a target resemblance group (malignant-like)
The bar chart shows directional deltas per feature

Interpretation

Big deltas: features that most define the difference between the query and the target centroid
Small deltas: features already aligned with the target group

Critical caution This is counterfactual geometry, not causal intervention. It does NOT imply:

“change this biological trait”
“this is treatment” It only visualizes: “these dimensions matter most for shifting similarity in this dataset.”

7) Minimal-Change Lab, shift toward benign-like neighborhood (educational)

Screenshot 2025-12-21 at 14-03-10 Tumor Doppelgänger Studio

Same tool, different target group.

Why this view is powerful Comparing benign-like vs malignant-like shifts helps see:

whether the query is “closer” to one group than the other
which features dominate movement toward each group

If benign-like shifts are huge but malignant-like are small:

the case is geometrically much closer to malignant regions

8) Dataset Explorer, transparency layer (rows, columns, class counts)

Screenshot 2025-12-21 at 14-03-40 Tumor Doppelgänger Studio

What ’re seeing

dataset size information (rows, features)
class distribution (benign vs malignant)
a scrollable table view of the dataset

Why it matters Interpretability without transparency can be misleading. This page ensures can:

verify columns
inspect values
confirm preprocessing assumptions
understand the dataset foundation behind every chart

Install & run

1) Create a virtual environment (recommended)

python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate

2) Install dependencies

pip install -r requirements.txt

3) Run the Streamlit app

streamlit run app/app.py

CLI usage

Examples:

python -m src.cli prepare
python -m src.cli index

Use these when want deterministic rebuilds of:

cleaned dataset
neighbor index artifacts

Project structure

Tumor-Doppelgänger-Studio/
├─ app/
│  └─ app.py                    # Streamlit UI (tabs, plots, controls)
├─ src/
│  ├─ cli.py                    # CLI entrypoints (prepare/index/etc.)
│  ├─ config.py                 # Paths + constants
│  ├─ data_prep.py              # Load/clean/standardize features
│  ├─ similarity.py             # kNN index + neighbor queries
│  ├─ explain.py                # Driver explanations + deltas
│  └─ utils.py                  # helpers
├─ data/
│  ├─ raw/
│  │  └─ data.csv               # Kaggle dataset copy
│  └─ processed/
│     └─ clean.csv              # processed dataset used by app
├─ models/
│  └─ twin_index.joblib         # saved neighbor index (rebuildable)
├─ requirements.txt
├─ LICENSE
└─ README.md

How to interpret outputs

“Neighborhood mix” is an uncertainty hint

All malignant neighbors: strong malignant-like region (in this dataset’s geometry)
All benign neighbors: strong benign-like region
Mixed neighbors: boundary behavior (the most interesting cases for interpretability)

Distance curve tells whether k is meaningful

If distance rises sharply after rank ~3, r “real neighborhood” is small.
If distance grows gradually, k=10–20 remains locally meaningful.

Drivers are why the geometry looks this way

Drivers are not “cause.” They’re “what feature dimensions separate the query and its neighbors.”

Minimal-change is educational counterfactual geometry

It answers:

“what shift would move this vector toward a different neighborhood?” It does not imply real-world action.

Dataset & copyright

Dataset link (Kaggle): https://www.kaggle.com/datasets/neurocipher/breast-cancer-dataset

Copyright / licensing note

The dataset belongs to its original authors/uploaders.
Kaggle datasets can have specific licenses/terms-of-use on the dataset page.
This repository uses the dataset for educational/demo purposes.
If publish/redistribute, review and comply with the dataset’s Kaggle license and terms.

Limitations & responsible use

Not clinical: no medical decisions.
Dataset-bound: similarity is only meaningful relative to the dataset’s feature distributions.
Distance is a proxy: closeness depends on chosen features + scaling choice.
Not causal: deltas and minimal-change are geometry-based explanations.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
app		app
data		data
models		models
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Tumor Doppelgänger Studio

Medical disclaimer (read first)

Why this project is different

What the app does (high-level)

Table of contents

How it works

1) Feature preparation

2) Similarity via kNN neighborhoods

3) “Drivers” (explanation by feature deltas)

4) Group contrast: benign-like vs malignant-like

5) Minimal-change lab (counterfactual geometry, educational)

Screenshots & walkthrough

1) Overview, neighborhood composition & “local context”

2) Neighbor distance curve, how fast similarity decays

3) Twin Gallery, closest benign and malignant look-alikes

4) Radar comparison, shape signature of query vs reference twins

5) Difference Fingerprint, standardized contrast (if both groups exist)

6) Minimal-Change Lab, shift toward malignant-like neighborhood (educational)

7) Minimal-Change Lab, shift toward benign-like neighborhood (educational)

8) Dataset Explorer, transparency layer (rows, columns, class counts)

Install & run

1) Create a virtual environment (recommended)

2) Install dependencies

3) Run the Streamlit app

CLI usage

Project structure

How to interpret outputs

“Neighborhood mix” is an uncertainty hint

Distance curve tells whether k is meaningful

Drivers are why the geometry looks this way

Minimal-change is educational counterfactual geometry

Dataset & copyright

Limitations & responsible use

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages