feat: add DICOM and NIfTI handlers by varuni7 · Pull Request #91 · MIT-LCP/croissant-baker

varuni7 · 2026-04-19T15:03:30Z

Add two new file handlers to support medical and neuroimaging formats:

DICOMHandler (.dcm, .dicom): reads image geometry, modality, pixel encoding, and acquisition parameters via pydicom (header-only, stop_before_pixels=True). Also detects extension-less DICOM files via magic bytes (DICM at offset 128).
NIfTIHandler (.nii, .nii.gz): reads spatial dimensions, voxel spacing, data type, NIfTI version, and TR for 4D fMRI volumes via nibabel (header-only, data array never loaded).

Both handlers implement build_croissant() to produce a FileSet and summary RecordSet. The tr_seconds field is only added when at least one 4D volume is present in the dataset.

Also adds:

pydicom>=2.4 and nibabel>=5.0 to dependencies
48 tests (24 per handler) using tmp_path + synthetic files
DICOM and NIfTI sections in supported-formats.md
Regenerated docs/_generated/formats-table.md
Updated E2E snapshot outputs from upstream mlcroissant changes

Add two new file handlers to support medical and neuroimaging formats: - DICOMHandler (.dcm, .dicom): reads image geometry, modality, pixel encoding, and acquisition parameters via pydicom (header-only, stop_before_pixels=True). Also detects extension-less DICOM files via magic bytes (DICM at offset 128). - NIfTIHandler (.nii, .nii.gz): reads spatial dimensions, voxel spacing, data type, NIfTI version, and TR for 4D fMRI volumes via nibabel (header-only, data array never loaded). Both handlers implement build_croissant() to produce a FileSet and summary RecordSet. The tr_seconds field is only added when at least one 4D volume is present in the dataset. Also adds: - pydicom>=2.4 and nibabel>=5.0 to dependencies - 48 tests (24 per handler) using tmp_path + synthetic files - DICOM and NIfTI sections in supported-formats.md - Regenerated docs/_generated/formats-table.md - Updated E2E snapshot outputs from upstream mlcroissant changes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tompollard

Thanks @varuni7, this is great. I have added a few small notes inline, and will do some testing.

tompollard · 2026-04-19T23:33:40Z

+
+
+def _read_dicom_properties(file_path: Path) -> Dict:
+    import pydicom


Please could you move this import to the top of the module?

tompollard · 2026-04-19T23:36:09Z

+    Extracts spatial dimensions, voxel spacing, data type, and TR for 4D volumes.
+    Works for both NIfTI-1 (.nii, .nii.gz) and NIfTI-2 formats.
+    """
+    import nibabel as nib


Please move import to top of module

tompollard · 2026-04-19T23:38:15Z

    }
  ],
-  "datePublished": "2021-05-06T00:00:00",
+  "datePublished": "2021-05-06",


Why has this changed? ("datePublished": "2021-05-06T00:00:00" -> "datePublished": "2021-05-06")

tompollard · 2026-04-19T23:38:53Z

    "name": "GitHub Archive"
  },
-  "datePublished": "2023-01-01T00:00:00",
+  "datePublished": "2023-01-01",


Why has this changed? ("datePublished": "2023-01-01T00:00:00" -> "datePublished": "2023-01-01")

tompollard · 2026-04-19T23:39:48Z

    "name": "Hillel Yaffe Medical Center"
  },
-  "datePublished": "2024-11-14T00:00:00",
+  "datePublished": "2024-11-14",


As above, why has this timestamp changed?

tompollard · 2026-04-19T23:40:05Z

    }
  ],
-  "datePublished": "2023-01-06T00:00:00",
+  "datePublished": "2023-01-06",


As above, why has this timestamp changed?

tompollard · 2026-04-19T23:41:54Z

+
+
+def test_nifti_handler_registered() -> None:
+    from croissant_baker.handlers.registry import find_handler, register_all_handlers


Unless there is a good reason to import within a function, please move imports to the top of the module.

tompollard · 2026-04-20T01:09:12Z

I tested on several DICOMs downloaded from PhysioNet and overall it worked well. One minor issue is that a couple of the DICOMs had empty modality (0008,0060) fields in their DICOM header, so the handler's _safe_get(ds, "Modality") call returns None.

As a result, the fileset description shows 6 DICOM file(s) (DX (2), US (2)) (i.e. the modalities sum to 4 but there are 6 files).

    {
      "@type": "cr:FileSet",
      "@id": "dicom-files",
      "name": "DICOM files",
      "description": "6 DICOM file(s) (DX (2), US (2))",
      "encodingFormat": "application/dicom",
      "includes": [
        "**/*.dcm",
        "**/*.dicom"
      ]
    }

Perhaps instead we should acknowledge the missing modalities here (e.g. 6 DICOM file(s) (DX (2), US (2), unknown (2)))?

Another minor question is whether we should capture DICOM tags in the metadata to demonstrate where values were extracted from, e.g. instead of:

  {                                                                                                                                               
    "@type": "cr:Field",
    "@id": "dicom/modality",                                                                                                                      
    "name": "modality",                                                                                                                           
    "description": "DICOM modality (e.g. CT, MR, PT)",                                                                                            
    "dataType": "sc:Text",
    "source": {
      "fileSet": { "@id": "dicom-files" },
      "extract": { "fileProperty": "content" }
    }
  }

...we could do:

  {
    "@type": "cr:Field",
    "@id": "dicom/modality",
    "name": "modality",
    "description": "DICOM Modality (0008,0060)",
    "dataType": "sc:Text",
    "source": {
      "fileSet": { "@id": "dicom-files" },
      "extract": { "fileProperty": "content" }
    }
  }

These points don't need to be addressed in this PR. We could raise an issue and add in a future improvement.

tompollard · 2026-04-20T01:20:06Z

Also tested on several .nii files and it worked well. As with DICOM, it would be nice if we could capture which values were extracted from which NIfTI header property. This could be addressed in a later PR.

rafiattrach

@varuni7 Thanks for the great addition! Could you please take a look at some of the comments down below?

rafiattrach · 2026-04-24T23:33:06Z

+def _read_dicom_properties(file_path: Path) -> Dict:
+    import pydicom
+
+    ds = pydicom.dcmread(str(file_path), stop_before_pixels=True)


Tested this on pydicom's own test corpus (pydicom/pydicom-data, 85 files). 5 files fail with 'DICM' prefix is missing from the header. Use force=True to force reading. These are raw pixel stream fragments referenced from a DICOMDIR, not broken DICOMs. Per DICOM Part 10 a file without the DICM preamble at offset 128 is not a standalone DICOM file, so the correct behaviour is to skip, not raise.

Suggested fix: run .dcm and .dicom through the existing _has_dicom_magic guard (currently only used on the extensionless branch) and skip cleanly when the preamble is missing. A single summary line like Skipped N DICOM file(s) without DICM preamble keeps the output honest without drowning the user in per file tracebacks. Pairs with the can_handle comment below.

rafiattrach · 2026-04-25T00:48:34Z

+
+    def can_handle(self, file_path: Path) -> bool:
+        suffix = file_path.suffix.lower()
+        if suffix in SUPPORTED_EXTENSIONS:


can_handle defines the handler's contract to the registry, so when it says True and then extract_metadata raises on the file's content, the contract is broken. Right now a .dcm file without the DICM preamble returns True here but then raises inside extract_metadata, while an extensionless file with identical content is correctly rejected. The asymmetry has no principled basis.

I had a look at the other handlers to see whether this is a broader concern. Text format handlers (CSV, TSV, JSON, FHIR, WFDB .hea) are fine as they are because content validation is essentially parsing. The handlers that would benefit from the same uniform magic byte check are the other binary formats:

ImageHandler: PNG (\x89PNG), JPEG (\xff\xd8\xff), TIFF (II*\0 / MM\0*) all have well known magic bytes

ParquetHandler: PAR1 at the start and end of every Parquet file

NIfTIHandler: n+1\0 at offset 344 (NIfTI-1) or n+2\0 at offset 4 (NIfTI-2)

Not suggesting we fix those in this PR, that would blow the scope. But it would be good to fix the DICOM asymmetry here, since it is already half implemented, and open a follow up issue to extend the same pattern to the other binary handlers. Suggested fix for this PR: apply _has_dicom_magic to .dcm and .dicom too, not only the extensionless branch, and skip cleanly when the preamble is missing (pairs with the _read_dicom_properties comment above).

rafiattrach · 2026-04-25T00:55:20Z

+    if manufacturer is not None:
+        props["manufacturer"] = str(manufacturer).strip()
+
+    sop_class = _safe_get(ds, "SOPClassUID")


This captures SOPClassUID but not (0020,000D) StudyInstanceUID or (0020,000E) SeriesInstanceUID.

From what I researched, DICOM organises data as a 4 level tree: Patient, Study, Series, Instance. Each level has its own UID. SeriesInstanceUID is the one that groups images into scans (one CT series is one scan, one MR series is one pulse sequence). SOPClassUID is a different thing: it is a class reference that tells you what kind of object the file is ("CT Image Storage", "MR Image Storage", "Segmentation Storage"), not which scan it belongs to.

Without SeriesInstanceUID in the emitted metadata, a user with 500 CT slices on disk cannot regroup them into the 20 scans they belong to, which is the most basic operation anyone performs on a DICOM dataset. This is a functional gap rather than a nice to have.

Let's add both if you agree, exposed as cr:Fields in build_croissant. Something like this?

study_uid = _safe_get(ds, "StudyInstanceUID") if study_uid is not None: props["study_instance_uid"] = str(study_uid) series_uid = _safe_get(ds, "SeriesInstanceUID") if series_uid is not None: props["series_instance_uid"] = str(series_uid)

rafiattrach · 2026-04-25T00:56:13Z

+        dtypes_str = ", ".join(dtype_counts.keys()) if dtype_counts else "unknown dtype"
+
+        mime_types = set()
+        includes = []


includes = [] is overwritten four lines later and does not depend on per file metadata

could be removed/tidied up?

rafiattrach · 2026-04-25T00:57:22Z

+        t_range = summary.get("dim_t_range")
+        ndim_max = summary.get("ndim_max", 3)
+        if ndim_max >= 4 and t_range:
+            dims_note += f", {t_range[0]}-{t_range[1]} volumes"


The spatial dim loop above already collapses N-N to N. Same rule here keeps the description consistent so a homogeneous 4D set reads (64x64x30, 100 volumes) instead of (64x64x30, 100-100 volumes), potentially something like this:

if ndim_max >= 4 and t_range: if t_range[0] == t_range[1]: dims_note += f", {t_range[0]} volumes" else: dims_note += f", {t_range[0]}-{t_range[1]} volumes"

tompollard reviewed Apr 19, 2026

View reviewed changes

rafiattrach requested changes Apr 25, 2026

View reviewed changes

This was referenced Apr 25, 2026

CLI "Files" count in summary includes FileSets, not just FileObjects #92

Open

Apply magic byte validation in can_handle for binary format handlers #93

Open



		def _read_dicom_properties(file_path: Path) -> Dict:
		import pydicom



		def test_nifti_handler_registered() -> None:
		from croissant_baker.handlers.registry import find_handler, register_all_handlers

Conversation

varuni7 commented Apr 19, 2026

Uh oh!

tompollard left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tompollard commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tompollard commented Apr 20, 2026

Uh oh!

rafiattrach left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tompollard commented Apr 20, 2026 •

edited

Loading