Skip to content

feat: add DICOM and NIfTI handlers #91

Open
varuni7 wants to merge 1 commit intomainfrom
feat/dicom-nifti-handlers
Open

feat: add DICOM and NIfTI handlers #91
varuni7 wants to merge 1 commit intomainfrom
feat/dicom-nifti-handlers

Conversation

@varuni7
Copy link
Copy Markdown
Collaborator

@varuni7 varuni7 commented Apr 19, 2026

Add two new file handlers to support medical and neuroimaging formats:

  • DICOMHandler (.dcm, .dicom): reads image geometry, modality, pixel encoding, and acquisition parameters via pydicom (header-only, stop_before_pixels=True). Also detects extension-less DICOM files via magic bytes (DICM at offset 128).

  • NIfTIHandler (.nii, .nii.gz): reads spatial dimensions, voxel spacing, data type, NIfTI version, and TR for 4D fMRI volumes via nibabel (header-only, data array never loaded).

Both handlers implement build_croissant() to produce a FileSet and summary RecordSet. The tr_seconds field is only added when at least one 4D volume is present in the dataset.

Also adds:

  • pydicom>=2.4 and nibabel>=5.0 to dependencies
  • 48 tests (24 per handler) using tmp_path + synthetic files
  • DICOM and NIfTI sections in supported-formats.md
  • Regenerated docs/_generated/formats-table.md
  • Updated E2E snapshot outputs from upstream mlcroissant changes

Add two new file handlers to support medical and neuroimaging formats:

- DICOMHandler (.dcm, .dicom): reads image geometry, modality, pixel
  encoding, and acquisition parameters via pydicom (header-only,
  stop_before_pixels=True). Also detects extension-less DICOM files
  via magic bytes (DICM at offset 128).

- NIfTIHandler (.nii, .nii.gz): reads spatial dimensions, voxel
  spacing, data type, NIfTI version, and TR for 4D fMRI volumes via
  nibabel (header-only, data array never loaded).

Both handlers implement build_croissant() to produce a FileSet and
summary RecordSet. The tr_seconds field is only added when at least
one 4D volume is present in the dataset.

Also adds:
- pydicom>=2.4 and nibabel>=5.0 to dependencies
- 48 tests (24 per handler) using tmp_path + synthetic files
- DICOM and NIfTI sections in supported-formats.md
- Regenerated docs/_generated/formats-table.md
- Updated E2E snapshot outputs from upstream mlcroissant changes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@tompollard tompollard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @varuni7, this is great. I have added a few small notes inline, and will do some testing.



def _read_dicom_properties(file_path: Path) -> Dict:
import pydicom
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please could you move this import to the top of the module?

Extracts spatial dimensions, voxel spacing, data type, and TR for 4D volumes.
Works for both NIfTI-1 (.nii, .nii.gz) and NIfTI-2 formats.
"""
import nibabel as nib
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move import to top of module

}
],
"datePublished": "2021-05-06T00:00:00",
"datePublished": "2021-05-06",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this changed? ("datePublished": "2021-05-06T00:00:00" -> "datePublished": "2021-05-06")

"name": "GitHub Archive"
},
"datePublished": "2023-01-01T00:00:00",
"datePublished": "2023-01-01",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this changed? ("datePublished": "2023-01-01T00:00:00" -> "datePublished": "2023-01-01")

"name": "Hillel Yaffe Medical Center"
},
"datePublished": "2024-11-14T00:00:00",
"datePublished": "2024-11-14",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, why has this timestamp changed?

}
],
"datePublished": "2023-01-06T00:00:00",
"datePublished": "2023-01-06",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, why has this timestamp changed?



def test_nifti_handler_registered() -> None:
from croissant_baker.handlers.registry import find_handler, register_all_handlers
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is a good reason to import within a function, please move imports to the top of the module.

@tompollard
Copy link
Copy Markdown
Member

tompollard commented Apr 20, 2026

I tested on several DICOMs downloaded from PhysioNet and overall it worked well. One minor issue is that a couple of the DICOMs had empty modality (0008,0060) fields in their DICOM header, so the handler's _safe_get(ds, "Modality") call returns None.

As a result, the fileset description shows 6 DICOM file(s) (DX (2), US (2)) (i.e. the modalities sum to 4 but there are 6 files).

    {
      "@type": "cr:FileSet",
      "@id": "dicom-files",
      "name": "DICOM files",
      "description": "6 DICOM file(s) (DX (2), US (2))",
      "encodingFormat": "application/dicom",
      "includes": [
        "**/*.dcm",
        "**/*.dicom"
      ]
    }

Perhaps instead we should acknowledge the missing modalities here (e.g. 6 DICOM file(s) (DX (2), US (2), unknown (2)))?

Another minor question is whether we should capture DICOM tags in the metadata to demonstrate where values were extracted from, e.g. instead of:

  {                                                                                                                                               
    "@type": "cr:Field",
    "@id": "dicom/modality",                                                                                                                      
    "name": "modality",                                                                                                                           
    "description": "DICOM modality (e.g. CT, MR, PT)",                                                                                            
    "dataType": "sc:Text",
    "source": {
      "fileSet": { "@id": "dicom-files" },
      "extract": { "fileProperty": "content" }
    }
  }

...we could do:

  {
    "@type": "cr:Field",
    "@id": "dicom/modality",
    "name": "modality",
    "description": "DICOM Modality (0008,0060)",
    "dataType": "sc:Text",
    "source": {
      "fileSet": { "@id": "dicom-files" },
      "extract": { "fileProperty": "content" }
    }
  }

These points don't need to be addressed in this PR. We could raise an issue and add in a future improvement.

@tompollard
Copy link
Copy Markdown
Member

Also tested on several .nii files and it worked well. As with DICOM, it would be nice if we could capture which values were extracted from which NIfTI header property. This could be addressed in a later PR.

Copy link
Copy Markdown
Collaborator

@rafiattrach rafiattrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varuni7 Thanks for the great addition! Could you please take a look at some of the comments down below?

def _read_dicom_properties(file_path: Path) -> Dict:
import pydicom

ds = pydicom.dcmread(str(file_path), stop_before_pixels=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this on pydicom's own test corpus (pydicom/pydicom-data, 85 files). 5 files fail with 'DICM' prefix is missing from the header. Use force=True to force reading. These are raw pixel stream fragments referenced from a DICOMDIR, not broken DICOMs. Per DICOM Part 10 a file without the DICM preamble at offset 128 is not a standalone DICOM file, so the correct behaviour is to skip, not raise.

Suggested fix: run .dcm and .dicom through the existing _has_dicom_magic guard (currently only used on the extensionless branch) and skip cleanly when the preamble is missing. A single summary line like Skipped N DICOM file(s) without DICM preamble keeps the output honest without drowning the user in per file tracebacks. Pairs with the can_handle comment below.


def can_handle(self, file_path: Path) -> bool:
suffix = file_path.suffix.lower()
if suffix in SUPPORTED_EXTENSIONS:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can_handle defines the handler's contract to the registry, so when it says True and then extract_metadata raises on the file's content, the contract is broken. Right now a .dcm file without the DICM preamble returns True here but then raises inside extract_metadata, while an extensionless file with identical content is correctly rejected. The asymmetry has no principled basis.

I had a look at the other handlers to see whether this is a broader concern. Text format handlers (CSV, TSV, JSON, FHIR, WFDB .hea) are fine as they are because content validation is essentially parsing. The handlers that would benefit from the same uniform magic byte check are the other binary formats:

  • ImageHandler: PNG (\x89PNG), JPEG (\xff\xd8\xff), TIFF (II*\0 / MM\0*) all have well known magic bytes
  • ParquetHandler: PAR1 at the start and end of every Parquet file
  • NIfTIHandler: n+1\0 at offset 344 (NIfTI-1) or n+2\0 at offset 4 (NIfTI-2)

Not suggesting we fix those in this PR, that would blow the scope. But it would be good to fix the DICOM asymmetry here, since it is already half implemented, and open a follow up issue to extend the same pattern to the other binary handlers. Suggested fix for this PR: apply _has_dicom_magic to .dcm and .dicom too, not only the extensionless branch, and skip cleanly when the preamble is missing (pairs with the _read_dicom_properties comment above).

if manufacturer is not None:
props["manufacturer"] = str(manufacturer).strip()

sop_class = _safe_get(ds, "SOPClassUID")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This captures SOPClassUID but not (0020,000D) StudyInstanceUID or (0020,000E) SeriesInstanceUID.

From what I researched, DICOM organises data as a 4 level tree: Patient, Study, Series, Instance. Each level has its own UID. SeriesInstanceUID is the one that groups images into scans (one CT series is one scan, one MR series is one pulse sequence). SOPClassUID is a different thing: it is a class reference that tells you what kind of object the file is ("CT Image Storage", "MR Image Storage", "Segmentation Storage"), not which scan it belongs to.

Without SeriesInstanceUID in the emitted metadata, a user with 500 CT slices on disk cannot regroup them into the 20 scans they belong to, which is the most basic operation anyone performs on a DICOM dataset. This is a functional gap rather than a nice to have.

Let's add both if you agree, exposed as cr:Fields in build_croissant. Something like this?

study_uid = _safe_get(ds, "StudyInstanceUID")
if study_uid is not None:
    props["study_instance_uid"] = str(study_uid)

series_uid = _safe_get(ds, "SeriesInstanceUID")
if series_uid is not None:
    props["series_instance_uid"] = str(series_uid)

dtypes_str = ", ".join(dtype_counts.keys()) if dtype_counts else "unknown dtype"

mime_types = set()
includes = []
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

includes = [] is overwritten four lines later and does not depend on per file metadata

could be removed/tidied up?

t_range = summary.get("dim_t_range")
ndim_max = summary.get("ndim_max", 3)
if ndim_max >= 4 and t_range:
dims_note += f", {t_range[0]}-{t_range[1]} volumes"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spatial dim loop above already collapses N-N to N. Same rule here keeps the description consistent so a homogeneous 4D set reads (64x64x30, 100 volumes) instead of (64x64x30, 100-100 volumes), potentially something like this:

if ndim_max >= 4 and t_range:
    if t_range[0] == t_range[1]:
        dims_note += f", {t_range[0]} volumes"
    else:
        dims_note += f", {t_range[0]}-{t_range[1]} volumes"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants