-
Notifications
You must be signed in to change notification settings - Fork 50
feat: databricks deltalake dataset support #920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for the PR! I tested this PR on a Databricks cluster and encountered a DeltaProtocolError. It appears the deltalake reader used here isn't compatible with tables that use Deletion Vectors or Column Mapping (which are now enabled by default on Databricks). Relevant discussions: delta-io/delta-rs#1094
|
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
3df865b to
0936f94
Compare
99fb5e1 to
b990139
Compare
1c75572 to
b8efd47
Compare
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
dd66497 to
caa8a6c
Compare
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
|
/ok to test 937cf25 |
|
/ok to test 7ca3800 |
|
/ok to test a460cb3 |
|
/ok to test 072146b |
HuiyingLi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks!
This PR adds Delta Lake / Databricks support for streaming instruction datasets in NeMo Automodel.
Key changes:
New
DeltaLakeDatasetdelta_lake_dataset.py:694-732- A streaming dataset that reads from Delta Lake tables (local, S3, Azure, GCS, or Databricks Unity Catalog). Automatically selects the best backend:New
ColumnMappedTextInstructionIterableDataset column_mapped_text_instruction_iterable_dataset.py:108-119- A streaming variant that acceptsdelta_storage_options,delta_sql_queryparameters and routes Delta paths to the new backend.Map-style dataset rejects Delta paths
column_mapped_text_instruction_dataset.py:239-240- Forces users to use the streaming variant for Delta Lake sources.New
ReservoirSamplerreservoir_sampler.py:21-34- Bounded-memory shuffle for streaming datasets.Usage example (YAML):