Follows the Croissant 1.1 / RAI schema. Image bytes are not redistributed; this card describes derived metadata only (file paths, labels, perceptual hashes, NCC values).
| Field | Value |
|---|---|
| Name | CV2024 public-validation audit dataset (canonical split: le6; sensitivity split: le6_plus_internal) |
| Contents | Splits, per-image perceptual hashes (pHash, dHash, PDQ), pixel NCC values, audit summaries, and reproduction code. The primary public-pool fixed-list retraining contrast is |
| License | CC-BY 4.0 (metadata only) |
| Format | CSV (UTF-8) + JSON |
| Hosting | Anonymous mirror for review at https://anonymous.4open.science/r/NeurIPS2026ED-CV2024-Audit/; on acceptance, mirrored to Zenodo with DOI (OSF as secondary mirror). |
| Original datasets | Kvasir-Capsule (Smedsrud et al. 2021), SEE-AI, KID, AIIMS CE24 (aggregated by Capsule Vision 2024 Challenge, Handa et al. 2024). |
| Not redistributed | Image bytes. Users must download the originals under their licenses. |
- Reproduce the audit claims in the paper (leakage rates, fixed-list
retraining
$\Delta$ , protocol-gap numbers) by joining our CSVs to the original image files. - Train and evaluate on the
le6orle6_plus_internalsensitivity splits alongside the original CV2024training/andvalidation/CSVs to report Kvasir-origin-removed public-pool accuracy beside the original public-pool accuracy. - Independently replicate the perceptual-hash and pixel-NCC analyses: we release the 64-bit pHash and dHash and the 256-bit PDQ bit-strings for every file and for every nearest-Kvasir match.
Not intended for: clinical decision-making; replacement of the CV2024 official AIIMS-test leaderboard. We include a direct evaluation on the organizer-released AIIMS test for our DINOv2 arms only; this is a scope check, not a benchmark replacement.
| Question | Recommended split | Scope statement to report |
|---|---|---|
| Kvasir-origin-removed CV2024 public-pool evaluation | canonical le6 train/val CSVs |
Public-pool, fixed-list evaluation with CV2024-KVASIR rows removed under pHash+dHash ≤6. Not an official-test or source-held-out claim. |
| Sensitivity to stricter same-source removal | le6_plus_internal validation CSV |
Same as le6, additionally excluding measured same-source train→val duplicates. A sensitivity check, not the primary split. |
| Cross-source low-hash-collision sensitivity | le6_strict CSVs |
Four-row cross-source sensitivity check (rows enumerated below). Not the recommended primary split. |
| Re-scoring CV2024 challenge submissions | results/cv2024_rescored_*.json, results/cv2024_m7_inference.json, results/cv2024_public_official_bridge.json |
Re-score public-validation prediction sheets from CV2024 challenge participants. Do not reinterpret as an official Capsule Vision ranking. |
| AIIMS-test direct evaluation for our DINOv2 arms | results/official_test/ |
AIIMS-only hidden-test direct-evaluation result for these trained arms, not a benchmark replacement. |
| Split | Files | Description |
|---|---|---|
cv2024_training_dedup_le0.csv |
21,241 | original CV2024 training minus 23,444 pHash+dHash-exact matches to Kvasir-Capsule labeled frames |
cv2024_training_dedup_le2.csv |
10,825 | strict near-duplicate removal (both pHash and dHash Hamming ≤ 2) |
cv2024_training_dedup_le6.csv (canonical) |
10,596 | full KVASIR-source removal (both pHash and dHash Hamming ≤ 6); the SEE-AI/KID/AIIMS-only training pool |
cv2024_training_dedup_le6_strict.csv |
10,592 |
le6 minus 4 enumerated cross-source pHash+dHash ≤6 rows (SEE-AI/Erosion image17118.jpg, SEE-AI/Polyp image00664.jpg, KID/Normal normaleso220.jpg, AIIMS/Worms worm1_2555.jpg; DINOv2 cosine 0.4–0.8 — incidental matches, not content duplicates). One additional SEE-AI/Erosion low-hash collision (image07362.jpg) remains in both le6 and le6_strict and is exposed in the annotation CSVs. Included for reproducibility. Paired contrast against le6 (n=10 paired): |Δ| ≤0.022 per-source, all non-significant at α=0.05 (KVASIR 0.004, SEE-AI 0.002, KID 0.022, AIIMS 0.000). |
cv2024_validation_dedup_le0.csv |
9,054 | matching validation set at le0 (CV2024 original val 16,132 minus 7,078 pHash-exact KVASIR-source duplicates) |
cv2024_validation_dedup_le2.csv |
4,655 | matching validation set at le2 (CV2024 original val 16,132 minus 11,477 pHash$\wedge$dHash$\leq 2$ KVASIR-source matches) |
cv2024_validation_dedup_le6.csv (canonical) |
4,551 | matching validation set at le6
|
cv2024_validation_le6_plus_internal.csv |
4,326 | canonical le6 val, additionally excluding 171 SEE-AI filename-collision pairs and intra-source pHash-exact duplicates |
cv2024_training_random10596_s{0,42}.csv |
10,596 | size-matched random draws from the full original CV2024 pool (sample-size control) |
cv2024_training_le6_kvfree_s1.csv |
10,880 | training list for the same-source/domain re-exposure probe; an identical, path-normalized copy is mirrored under results/cv2024_training_le6_kvfree_s1.csv for cross-reference |
Schema (training/validation dedup CSVs in artifacts/csvs/) — matches
the actual released files (verify with head -1 on any CSV):
image_path, Dataset, Angioectasia, Bleeding, Erosion, Erythema,
Foreign Body, Lymphangiectasia, Normal, Polyp, Ulcer, Worms
image_path(str): relative path under CV2024 root, e.g.training\\Normal\\KVASIR\\<video_id>_<frame>.jpg.Dataset(str): one of{KVASIR, SEE-AI, KID, AIIMS}.- Remaining 10 columns: one-hot class indicators (exactly one is 1).
Per-file leakage metadata (pHash/dHash/PDQ/NCC/nearest_kvasir_file) lives in
artifacts/annotations/ and artifacts/ncc/ — see the next two schema
blocks below.
The exact KVASIR validation-to-training pHash-exact pair list is released as
artifacts/annotations/cv2024_KVASIR_internal_train_val_phash_exact_pairs.csv
(
Hash annotations (artifacts/annotations/*.csv):
source, partition, cv_dataset, cv_split, filename, path,
phash (hex, 64-bit), dhash (hex, 64-bit),
min_phash_dist_to_kvasir, min_dhash_dist_to_kvasir, nearest_kvasir_file
NCC annotations (artifacts/ncc/cv2024_KVASIR_ncc_full.csv) — one row
per flagged pair, all 38,592 pairs:
cv_file, cv_path, kvasir_file, kvasir_path,
phash_dist, dhash_dist, ncc (float in [-1, 1])
-
Hash extraction. For every file in CV2024 (
training+validation, 53,739 images) and every labeled Kvasir-Capsule frame (47,238 images), we compute 64-bit pHash and 64-bit dHash via ImageHash 4.3.1 athash_size=8, and 256-bit PDQ viapdqhash0.2.6. Full hash dumps (artifacts/hashes/{hashes,pdq_hashes}_{cv2024,kvasir}.json, ~105 MB) are not included in this release to keep the dataset small; regenerate withbash scripts/00_run_full_audit.sh(~25 min CPU after$CV2024_ROOTand$KVASIR_ROOTare set). -
Nearest-Kvasir annotation. For each CV2024 file, we brute-force search its nearest Kvasir-Capsule labeled-frame hash (minimum Hamming distance per hash family). Per-row pHash annotations (
cv2024_*_phash_annotated.csv) are included and read by the consistency check; the larger per-row PDQ annotations for KVASIR (25 MB) and SEE-AI (9 MB) are regenerated by Stage 0 alongside the hash dumps. The compressed summary (artifacts/summaries/pdq_audit.json) is included and is the source for every paper-cited PDQ number. -
Pixel-level verification. For every flagged KVASIR-source pair (all 38,592), we compute Normalized Cross-Correlation (NCC) on grayscale-converted, resized-to-256×256 images. Stored in
artifacts/ncc/cv2024_KVASIR_ncc_full.csv. -
Label inheritance. We map Kvasir-Capsule's 14-class vocabulary to CV2024's 10-class schema using the official organizer mapping (
artifacts/summaries/cv_to_kvasir.json) and verify one-hot consistency across the 38,584 overlapping entries. -
Internal audit. Within CV2024 itself, we check for train→validation near-duplicates by the same three-hash pipeline, attribute every pHash-exact KVASIR pair to its Kvasir-Capsule labeled video, and record per-source, per-video, and video-prefix leakage statistics.
-
Dedup split generation. For each threshold
$t \in {0, 2, 6}$ , we remove every CV2024 file whose pHash and dHash distances to a Kvasir-Capsule labeled frame are both$\leq t$ . The canonicalle6_plus_internaladditionally removes within-source exact duplicates.
| Class | SEE-AI train | KID train | AIIMS train | Total train | SEE-AI val | KID val | AIIMS val | Total val |
|---|---|---|---|---|---|---|---|---|
| Angioectasia | 530 | 18 | 0 | 548 | 228 | 9 | 0 | 237 |
| Bleeding | 519 | 3 | 0 | 522 | 223 | 2 | 0 | 225 |
| Erosion | 2,340 | 0 | 0 | 2,340 | 1,003 | 0 | 0 | 1,003 |
| Erythema | 580 | 0 | 0 | 580 | 249 | 0 | 0 | 249 |
| Foreign Body | 249 | 0 | 0 | 249 | 107 | 0 | 0 | 107 |
| Lymphangiectasia | 376 | 6 | 0 | 382 | 162 | 3 | 0 | 165 |
| Normal | 4,312 | 315 | 0 | 4,627 | 1,849 | 136 | 0 | 1,985 |
| Polyp | 1,090 | 34 | 0 | 1,124 | 468 | 15 | 0 | 483 |
| Ulcer | 0 | 0 | 66 | 66 | 0 | 0 | 29 | 29 |
| Worms | 0 | 0 | 158 | 158 | 0 | 0 | 68 | 68 |
| Total | 9,996 | 376 | 224 | 10,596 | 4,289 | 165 | 97 | 4,551 |
Note: le6 removes all KVASIR-source examples. The remaining pool is
source-skewed (SEE-AI dominates most classes; Ulcer and Worms are
AIIMS-exclusive). This is not a source-held-out evaluation — SEE-AI,
KID, and AIIMS each appear on both the training and the validation
side. Leave-one-source-out evaluation is out of scope for this dataset.
- Sensitive information. The underlying corpora are de-identified medical images and should be treated as sensitive health data even without direct identifiers. This metadata release adds no patient identifiers. Kvasir-Capsule filenames expose video-prefix-level grouping prefixes; patient-level disjointness is unavailable from the public metadata.
-
Use cases we flag as potentially harmful: using
le6to claim cross-source or leave-one-source-out generalization (it is neither). - Social impact. Correcting the CV2024 public validation score interpretability reduces the risk that downstream medical-AI papers stack inflated numbers.
-
Known biases.
- Source skew:
le6is dominated by SEE-AI (94 % of training, 94 % of validation); Ulcer and Worms classes exist only in AIIMS. - The pHash and dHash Hamming thresholds (≤ 6) miss most rotations
≥ 5° and center-crops ≥ 10 % (see
artifacts/summaries/phash_geometric_robustness.json); the reported near-duplicate flag rate is conservative for crop and large-rotation variants. PDQ at a negative-control-calibrated threshold (≤ 50) corroborates at 99.4 %.
- Source skew:
-
Limitations in the fixed-list retraining contrast. Removing
le6from training simultaneously drops (i) Kvasir-source re-exposure, (ii) domain-matched KVASIR imagery, and (iii) source/class priors. An exploratory matched-arm decomposition (Comp-A Ulcer source swap$n{=}10$ , Comp-B Ulcer doubled$n{=}10$ ) accounts for$\sim 59%$ of$\Delta_{\texttt{le6}}$ as a fixed-split residual KVASIR-content re-exposure contrast (fixed-list residual$-0.126$ on shared seeds; paired-$\Delta$$\Delta_{\text{size}}{=}{-}0.046$ ,$\Delta_{\text{class}}{=}{-}0.041$ ), with Comp-B TOST \emph{EQUIVALENT} inside a documented$\pm 0.010$ sensitivity bound before the extension seeds after an$n{=}4$ pilot; Comp-C shows that AIIMS-only Ulcer doubling does not recover the Comp-B gain. A same-source/domain re-exposure probe (Exp.~1,$n{=}10$ paired seeds) recovers$+0.091 \pm 0.003$ ~SE by adding$284$ non-overlapping Kvasir-Capsule frames tole6, consistent with public-validation inflation reflecting same-source familiarity in addition to direct frame re-exposure. Cross-model (DINOv2-L, DINOv2-B, DINOv2-S, ResNet-50, and ConvNeXt-Tiny, all$n{=}10$ ) residual is$-0.167$ /$-0.156$/$-0.148$/$-0.156$/$-0.208$. -
Per-source decomposition. The
$-0.213$ pooled drop is highly asymmetric: KVASIR subset$-0.473$ , SEE-AI$+0.013$ , KID$-0.044$ , AIIMS$+0.021$ ($n{=}10$ paired). The magnitude therefore reflects in-distribution KVASIR-content scoring, not generic cross-source generalization (for which see LOSO in the paper appendix).
| Source | Role in this dataset | Public access / license evidence | Consent / de-identification status | Bytes redistributed here? |
|---|---|---|---|---|
| CV2024 public training + validation | Original file paths, labels, and public split audited here | Figshare Version 3, DOI 10.6084/m9.figshare.26403469.v3, CC BY 4.0 |
Governed by CV2024 release terms; this dataset adds no patient identifiers | No |
| CV2024 official AIIMS test | Direct scope check for our trained DINOv2 arms only | Figshare Version 4, DOI 10.6084/m9.figshare.27200664.v4, CC BY 4.0 |
Hidden during challenge, later organizer-released; used only through released labels/paths | No |
| Kvasir-Capsule labeled corpus | Named public source used for hash/NCC matching | OSF DOI 10.17605/OSF.IO/DV2AG; dataset page at https://datasets.simula.no/kvasir-capsule/ |
Public medical-image corpus; patient IDs unavailable in the metadata used here | No |
| SEE-AI, KID, AIIMS-derived public CV2024 files | Non-KVASIR controls and remaining le6 public-pool rows |
Accessed through the CV2024 public release | Sub-source consent/licensing is not independently re-verified beyond the CV2024 release; users must follow original dataset and challenge terms | No |
- CV2024 public training + validation corpus: downloadable from the
Capsule Vision 2024 Challenge organizers (Handa et al. 2024); our
CSVs reference file paths as they appear in the official distribution
(
training/<class>/<source>/<file>.jpgandvalidation/<class>/<source>/<file>.jpg). Official Figshare article: "Training and Validation Dataset of Capsule Vision 2024 Challenge", DOI10.6084/m9.figshare.26403469.v3, CC BY 4.0. - CV2024 official AIIMS test corpus: organizer-released,
class-separated test archive; official Figshare article "Testing Dataset
of Capsule Vision 2024 Challenge", DOI
10.6084/m9.figshare.27200664.v4, CC BY 4.0. We use it only for the direct scope check in Stage 7, not for model selection or leaderboard replacement. - Kvasir-Capsule labeled corpus: downloadable from
https://osf.io/dv2ag/ (authors: Smedsrud et al. 2021; DOI
10.17605/OSF.IO/DV2AG; mirror at
https://datasets.simula.no/kvasir-capsule/); our CSVs
reference file paths as they appear in the official
labelled_images/<class>/<file>.jpglayout. - AIIMS official test set: hidden during the challenge and later
organizer-released; used only for the DINOv2 scope check in
results/official_test/, not for participant re-ranking. - SEE-AI / KID / AIIMS-derived public CV2024 files: accessed only through the CV2024 public release. We do not redistribute image bytes and make no independent relicensing claim for those sources; users must follow the original dataset and challenge terms.
A user wishing to reproduce the audit must download the two public
corpora and set CV2024_ROOT and KVASIR_ROOT as environment variables
before running the scripts.