Skip to content

Conversation

@akoumpa
Copy link
Contributor

@akoumpa akoumpa commented Dec 3, 2025

This PR adds Delta Lake / Databricks support for streaming instruction datasets in NeMo Automodel.

Key changes:

New DeltaLakeDataset delta_lake_dataset.py:694-732 - A streaming dataset that reads from Delta Lake tables (local, S3, Azure, GCS, or Databricks Unity Catalog). Automatically selects the best backend:

  • deltalake library for simple tables
  • Spark for tables with deletion vectors
  • Databricks SQL Connector for Unity Catalog

New ColumnMappedTextInstructionIterableDataset column_mapped_text_instruction_iterable_dataset.py:108-119 - A streaming variant that accepts delta_storage_options, delta_sql_query parameters and routes Delta paths to the new backend.

Map-style dataset rejects Delta paths column_mapped_text_instruction_dataset.py:239-240 - Forces users to use the streaming variant for Delta Lake sources.

New ReservoirSampler reservoir_sampler.py:21-34 - Bounded-memory shuffle for streaming datasets.

Usage example (YAML):

dataset:
  _target_: ...IterableDataset
  path_or_dataset_id: delta://catalog
    .schema.training_data
  column_mapping:
    question: user_message
    answer: assistant_message
  delta_storage_options:
    DATABRICKS_TOKEN: ${oc.env:...}
    DATABRICKS_HOST: ${oc.env:...}

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 3, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa akoumpa changed the title deltalake support feat: deltalake dataset support Dec 3, 2025
@akoumpa akoumpa linked an issue Dec 3, 2025 that may be closed by this pull request
@floraxhuang
Copy link

floraxhuang commented Dec 10, 2025

Thanks for the PR! I tested this PR on a Databricks cluster and encountered a DeltaProtocolError. It appears the deltalake reader used here isn't compatible with tables that use Deletion Vectors or Column Mapping (which are now enabled by default on Databricks). Relevant discussions: delta-io/delta-rs#1094

  • Databricks runtime: 15.4 LTS

  • deltalake version: 1.2.1

  • Error Trace:

DeltaProtocolError: The table has set these reader features: {'deletionVectors'} but these are not yet supported by the deltalake reader.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/feat_streaming_deltalake branch from 3df865b to 0936f94 Compare January 15, 2026 13:43
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/feat_streaming_deltalake branch from 99fb5e1 to b990139 Compare January 15, 2026 15:28
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/feat_streaming_deltalake branch from 1c75572 to b8efd47 Compare January 15, 2026 15:58
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/feat_streaming_deltalake branch from dd66497 to caa8a6c Compare January 15, 2026 16:42
akoumpa and others added 2 commits January 15, 2026 16:43
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa changed the title feat: deltalake dataset support feat: databricks deltalake dataset support Jan 21, 2026
@akoumpa
Copy link
Contributor Author

akoumpa commented Jan 21, 2026

/ok to test 937cf25

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Contributor Author

akoumpa commented Jan 21, 2026

/ok to test 7ca3800

@akoumpa
Copy link
Contributor Author

akoumpa commented Jan 27, 2026

/ok to test a460cb3

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Contributor Author

akoumpa commented Jan 27, 2026

/ok to test 072146b

Copy link
Contributor

@HuiyingLi HuiyingLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Databricks deltatable steaming data

6 participants