Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
321 changes: 122 additions & 199 deletions scientific-skills/imaging-data-commons/SKILL.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# BigQuery Guide for IDC

**Tested with:** IDC data version v23
**Tested with:** idc-index 0.12.1 (IDC data version v24)

For most queries and downloads, use `idc-index` (see main SKILL.md). This guide covers BigQuery for advanced use cases requiring full DICOM metadata or complex joins.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Clinical Data Guide for IDC

**Tested with:** idc-index 0.11.7 (IDC data version v23)
**Tested with:** idc-index 0.12.1 (IDC data version v24)

Clinical data (demographics, diagnoses, therapies, lab tests, staging) accompanies many IDC imaging collections. This guide covers how to discover, access, and integrate clinical data with imaging data using `idc-index`.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,7 @@ IDC releases new data versions every 2-4 months. The versioning system ensures r

### How Versioning Works

1. **Snapshots**: Each IDC version (v1, v2, ..., v23, etc.) represents a complete snapshot of all data at release time
1. **Snapshots**: Each IDC version (v1, v2, ..., v24, etc.) represents a complete snapshot of all data at release time
2. **UUID-based**: When data changes, new CRDC UUIDs are assigned; old UUIDs remain accessible
3. **Cumulative buckets**: All versions coexist in the same buckets—old series folders

Expand All @@ -223,7 +223,7 @@ IDC releases new data versions every 2-4 months. The versioning system ensures r

For querying version-specific metadata, BigQuery provides versioned tables. See `bigquery_guide.md` for details.
- `bigquery-public-data.idc_current` — alias to latest version
- `bigquery-public-data.idc_v23` — specific version (replace 23 with desired version)
- `bigquery-public-data.idc_v24` — specific version (replace 24 with desired version)

### Reproducing a Previous Analysis

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Replace `{VERSION}` with the IDC release number. To find the current version:
```python
from idc_index import IDCClient
client = IDCClient()
print(client.get_idc_version()) # e.g., "23" for v23
print(client.get_idc_version()) # e.g., "v24" for current version
```

- **~96% data coverage** - Only replicates data from `idc-open-data` bucket (missing ~4% from other buckets)
Expand Down Expand Up @@ -334,7 +334,7 @@ credentials, project = default()
credentials.refresh(Request())

# Build authenticated request
base_url = "https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/idc-store-v23/dicomWeb"
base_url = "https://healthcare.googleapis.com/v1/projects/nci-idc-data/locations/us-central1/datasets/idc/dicomStores/idc-store-v24/dicomWeb"

response = requests.get(
f"{base_url}/studies",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Digital Pathology Guide for IDC

**Tested with:** IDC data version v23, idc-index 0.11.10
**Tested with:** idc-index 0.12.1 (IDC data version v24)

For general IDC queries and downloads, use `idc-index` (see main SKILL.md). This guide covers slide microscopy (SM) imaging, microscopy bulk simple annotations (ANN), and segmentations (SEG) in the context of digital pathology in IDC.

Expand Down Expand Up @@ -251,12 +251,12 @@ client.sql_query("""
SELECT
ar.analysis_result_id,
ar.analysis_result_title,
ar.Modalities,
ar.Subjects,
ar.Collections
ar.modalities,
ar.subjects,
ar.collections
FROM analysis_results_index ar
WHERE ar.Modalities LIKE '%ANN%' OR ar.Modalities LIKE '%SEG%'
ORDER BY ar.Subjects DESC
WHERE ar.modalities LIKE '%ANN%' OR ar.modalities LIKE '%SEG%'
ORDER BY ar.subjects DESC
""")
```

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Index Tables Guide for IDC

**Tested with:** idc-index 0.11.14 (IDC data version v23)
**Tested with:** idc-index 0.12.3 (IDC data version v24)

This guide covers the structure and access patterns for IDC index tables: programmatic schema discovery, DataFrame access, and join column references. For the overview of available tables and their purposes, see the "Index Tables" section in the main SKILL.md.

Expand Down Expand Up @@ -34,7 +34,7 @@ results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")

# Fetch and query additional indices
client.fetch_index("collections_index")
collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
collections = client.sql_query("SELECT collection_id, cancer_types, tumor_locations FROM collections_index")

client.fetch_index("analysis_results_index")
analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")
Expand Down Expand Up @@ -130,6 +130,9 @@ Use this table to identify join columns between index tables. Always call `clien
| `index` | `volume_geometry_index` | `index.SeriesInstanceUID = volume_geometry_index.SeriesInstanceUID` |
| `index` | `rtstruct_index` | `index.SeriesInstanceUID = rtstruct_index.SeriesInstanceUID` |
| `rtstruct_index` | `index` (source images) | `rtstruct_index.referenced_SeriesInstanceUID = index.SeriesInstanceUID` |
| `index` | `ct_index` | `index.SeriesInstanceUID = ct_index.SeriesInstanceUID` |
| `index` | `mr_index` | `index.SeriesInstanceUID = mr_index.SeriesInstanceUID` |
| `index` | `pt_index` | `index.SeriesInstanceUID = pt_index.SeriesInstanceUID` |

For complete query examples using these joins, see `references/sql_patterns.md`.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,9 @@ https://storage.googleapis.com/idc-index-data-artifacts/current/release_artifact
| `collections_index.parquet` | — | Collection-level metadata |
| `analysis_results_index.parquet` | — | Derived dataset metadata |
| `clinical_index.parquet` | ~0.2 MB | Clinical data column dictionary |
| `ct_index.parquet` | — | CT acquisition/reconstruction parameters |
| `mr_index.parquet` | — | MR sequence/acquisition parameters |
| `pt_index.parquet` | — | PET acquisition/radiopharmaceutical parameters |
| `prior_versions_index.parquet` | — | Series from previous IDC releases |

**Note:** the main index file is named `idc_index.parquet`, not `index.parquet`. Reference it with an alias in SQL queries (e.g., `FROM read_parquet(...) AS index`).
Expand Down
83 changes: 81 additions & 2 deletions scientific-skills/imaging-data-commons/references/sql_patterns.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# SQL Query Patterns for IDC

**Tested with:** idc-index 0.11.14 (IDC data version v23)
**Tested with:** idc-index 0.12.3 (IDC data version v24)

Quick reference for common SQL query patterns when working with IDC data. For detailed examples with context, see the "Core Capabilities" section in the main SKILL.md.

Expand All @@ -14,6 +14,7 @@ Load this guide when you need quick-reference SQL patterns for:
- Linking imaging data to clinical data
- Filtering by 3D volume geometry validity (volume_geometry_index)
- Finding RT Structure Set series and ROI metadata (rtstruct_index)
- Filtering by CT/MR/PET acquisition parameters (ct_index, mr_index, pt_index)

For table schemas, DataFrame access, and join column references, see `references/index_tables_guide.md`.

Expand Down Expand Up @@ -74,7 +75,7 @@ client.sql_query("""
# List analysis result collections (curated derived datasets)
client.fetch_index("analysis_results_index")
client.sql_query("""
SELECT analysis_result_id, analysis_result_title, Collections, Modalities
SELECT analysis_result_id, analysis_result_title, collections, modalities
FROM analysis_results_index
""")

Expand Down Expand Up @@ -277,6 +278,84 @@ client.sql_query("""
""")
```

## Modality Acquisition Parameters

`ct_index`, `mr_index`, and `pt_index` (added in idc-index 0.12.3) expose acquisition and reconstruction parameters for CT, MR, and PET series. All join on `SeriesInstanceUID`. Dose-modulated CT acquisitions have `_min`/`_max` columns for tube current, exposure, and exposure time.

```python
client.fetch_index("ct_index")
client.fetch_index("mr_index")
client.fetch_index("pt_index")

# CT: thin-slice series (≤2mm) with standard reconstruction
client.sql_query("""
SELECT i.collection_id, i.SeriesInstanceUID, i.BodyPartExamined,
c.SliceThickness, c.ConvolutionKernel, c.KVP
FROM index i
JOIN ct_index c ON i.SeriesInstanceUID = c.SeriesInstanceUID
WHERE c.SliceThickness <= 2.0
AND c.ConvolutionKernel IS NOT NULL
LIMIT 10
""")

# CT: dose-modulated acquisitions (tube current varies across slices)
client.sql_query("""
SELECT i.collection_id, c.SeriesInstanceUID,
c.XRayTubeCurrent_min, c.XRayTubeCurrent_max, c.SliceThickness
FROM ct_index c
JOIN index i ON c.SeriesInstanceUID = i.SeriesInstanceUID
WHERE c.XRayTubeCurrent_min != c.XRayTubeCurrent_max
LIMIT 10
""")

# MR: DWI series (have non-null DiffusionBValue) at 3T
client.sql_query("""
SELECT i.collection_id, i.SeriesInstanceUID, i.SeriesDescription,
m.MagneticFieldStrength, m.DiffusionBValue
FROM index i
JOIN mr_index m ON i.SeriesInstanceUID = m.SeriesInstanceUID
WHERE m.DiffusionBValue IS NOT NULL
AND m.MagneticFieldStrength >= 2.9
LIMIT 10
""")

# MR: multi-echo series (EchoTime stored as array with multiple values)
client.sql_query("""
SELECT i.collection_id, i.SeriesInstanceUID,
m.EchoTime, m.EchoTrainLength, m.ScanningSequence
FROM index i
JOIN mr_index m ON i.SeriesInstanceUID = m.SeriesInstanceUID
WHERE m.EchoTrainLength > 1
LIMIT 10
""")

# PET: FDG studies with specific reconstruction method
client.sql_query("""
SELECT i.collection_id, i.SeriesInstanceUID,
p.RadionuclideCodeMeaning, p.ReconstructionMethod,
p.Units, p.DecayCorrection
FROM index i
JOIN pt_index p ON i.SeriesInstanceUID = p.SeriesInstanceUID
WHERE p.RadionuclideCodeMeaning LIKE '%fluorodeoxyglucose%'
LIMIT 10
""")

# PET: dynamic acquisitions (ActualFrameDuration is array with multiple values)
client.sql_query("""
SELECT i.collection_id, i.SeriesInstanceUID,
p.NumberOfTimeSlices, p.ActualFrameDuration
FROM index i
JOIN pt_index p ON i.SeriesInstanceUID = p.SeriesInstanceUID
WHERE p.NumberOfTimeSlices > 1
LIMIT 10
""")
```

Key columns by table (use `client.indices_overview["ct_index"]["schema"]` for the full list):
- **ct_index**: `SliceThickness`, `KVP`, `ConvolutionKernel`, `SpiralPitchFactor`, `XRayTubeCurrent_min/max`, `Exposure_min/max`, `PixelSpacing_row_mm/col_mm`, `Rows`, `Columns`
- **mr_index**: `MagneticFieldStrength`, `ScanningSequence`, `SequenceVariant`, `MRAcquisitionType`, `EchoTime` (array), `RepetitionTime`, `FlipAngle`, `DiffusionBValue` (array), `NumberOfTemporalPositions`, `ReceiveCoilName`
- **pt_index**: `RadionuclideCodeMeaning`, `Radiopharmaceutical`, `RadionuclideTotalDose`, `ReconstructionMethod`, `DecayCorrection`, `AttenuationCorrectionMethod`, `ActualFrameDuration` (array), `NumberOfTimeSlices`

## Resources

- `references/index_tables_guide.md` for table schemas, DataFrame access, and join column references
Expand Down
94 changes: 93 additions & 1 deletion scientific-skills/imaging-data-commons/references/use_cases.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Common Use Cases for IDC

**Tested with:** idc-index 0.11.9 (IDC data version v23)
**Tested with:** idc-index 0.12.1 (IDC data version v24)

This guide provides complete end-to-end workflow examples for common IDC use cases. Each use case demonstrates the full workflow from query to download with best practices.

Expand Down Expand Up @@ -178,6 +178,98 @@ client.download_from_selection(
cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
```

## Use Case 5: Batch Download with Filtering

**Objective:** Download a large filtered dataset in batches to avoid timeouts

**Steps:**
```python
from idc_index import IDCClient
import pandas as pd

client = IDCClient()

# Find chest CT scans from GE scanners with a permissive license
query = """
SELECT
SeriesInstanceUID,
PatientID,
collection_id,
ManufacturerModelName
FROM index
WHERE Modality = 'CT'
AND BodyPartExamined = 'CHEST'
AND Manufacturer = 'GE MEDICAL SYSTEMS'
AND license_short_name = 'CC BY 4.0'
LIMIT 100
"""

results = client.sql_query(query)

# Save manifest for reproducibility
results.to_csv('lung_ct_manifest.csv', index=False)

# Download in batches to avoid timeout
batch_size = 10
for i in range(0, len(results), batch_size):
batch = results.iloc[i:i+batch_size]
client.download_from_selection(
seriesInstanceUID=list(batch['SeriesInstanceUID'].values),
downloadDir=f"./data/batch_{i//batch_size}"
)
```

## Use Case 6: Integration with Analysis Pipelines

**Objective:** Load downloaded DICOM files into Python for processing

**Read individual DICOM files with pydicom:**
```python
import pydicom
import os

series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."

dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir)
if f.endswith('.dcm')]

ds = pydicom.dcmread(dicom_files[0])
print(f"Patient ID: {ds.PatientID}")
print(f"Modality: {ds.Modality}")
print(f"Image shape: {ds.pixel_array.shape}")
```

**Build 3D volume from CT series:**
```python
import pydicom
import numpy as np
from pathlib import Path

def load_ct_series(series_path):
files = sorted(Path(series_path).glob('*.dcm'))
slices = [pydicom.dcmread(str(f)) for f in files]
slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))
volume = np.stack([s.pixel_array for s in slices])
return volume, slices[0]

volume, metadata = load_ct_series("./data/lung_ct/series_dir")
print(f"Volume shape: {volume.shape}") # (z, y, x)
```

**Load DICOM series with SimpleITK (recommended for correct geometry):**
```python
import SimpleITK as sitk

series_path = "./data/ct_series"
reader = sitk.ImageSeriesReader()
dicom_names = reader.GetGDCMSeriesFileNames(series_path)
reader.SetFileNames(dicom_names)
image = reader.Execute()

smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)
sitk.WriteImage(smoothed, "processed_volume.nii.gz")
```

## Resources

- Main SKILL.md for core API patterns (query, download, visualize)
Expand Down
Loading