Make `pull` and `read_dataset` from Studio atomic by shcheklein · Pull Request #1573 · datachain-ai/datachain

shcheklein · 2026-01-31T22:06:15Z

Downloading a remote dataset (via datachain pull or read_dataset with a remote source) wrote directly to the final table. A crash mid-download left corrupt partial data with no cleanup path.

Fix: downloads now stage into a temporary table, then atomically swap it into place. Failures drop the temp table — the final table is never left in a partial state.

File lock per dataset version prevents concurrent downloads from corrupting each other
GC now detects and cleans up orphaned partial downloads (with a 1h grace period for in-flight ones)
SQLite retry logic narrowed to only retry on actual lock contention

cloudflare-workers-and-pages · 2026-01-31T22:06:56Z

Deploying datachain with Cloudflare Pages

Latest commit:	`879b7d2`
Status:	✅ Deploy successful!
Preview URL:	https://abb77e07.datachain-2g6.pages.dev
Branch Preview URL:	https://make-pull-atomic.datachain-2g6.pages.dev

View logs

codecov · 2026-01-31T22:14:43Z

Codecov Report

❌ Patch coverage is 89.23077% with 21 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/datachain/data_storage/sqlite.py	31.57%	12 Missing and 1 partial ⚠️
src/datachain/utils.py	87.87%	6 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

tests/func/test_read_dataset_remote.py

shcheklein · 2026-02-05T18:23:27Z

src/datachain/catalog/catalog.py

                yield source, source.ls(fields)

-    def pull_dataset(  # noqa: C901, PLR0915
+    def pull_dataset(  # noqa: C901, PLR0915, PLR0912


Def need a refactoring, comes as a next PR to this ... including some cleanups to progress bars

shcheklein · 2026-02-05T18:24:06Z

src/datachain/data_storage/warehouse.py

    ) -> sa.Table:
        """Creates a dataset rows table for the given dataset name and columns"""

+    def create_temp_dataset_table(


@ilongin will some of these clash with the checkpoints PR?

I think it won't clash. I do think that maybe this wrapper is not even needed since you actually just call 2 methods in it and I think it's used only in one place, but it's a minor

Copilot

Pull request overview

This PR is a WIP to make Studio-backed dataset pull / read_dataset more failure-safe by staging remote rows into temporary tables and cleaning up incomplete local state so retries can succeed after mid-flight failures.

Changes:

Reworked Catalog.pull_dataset() to stage into a tmp_ table, then rename/commit and mark the dataset version COMPLETE only at the end.
Updated DatasetRowsFetcher to insert into an arbitrary staging table instead of inserting directly into the final dataset table.
Added/expanded functional tests covering cleanup and successful retry after export/download/parse/insert failures and simulated process kills.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/func/test_read_dataset_remote.py	Adds atomicity/cleanup and SIGKILL-retry functional tests for remote `read_dataset`.
tests/func/test_pull.py	Adds atomicity/cleanup-and-retry tests for `pull_dataset` under multiple failure modes.
src/datachain/data_storage/warehouse.py	Introduces generic temp-table creation, dataframe insertion by table name, and table rename helpers for staging/commit.
src/datachain/data_storage/sqlite.py	Removes the now-obsolete `insert_dataset_rows` path in the SQLite warehouse implementation.
src/datachain/catalog/catalog.py	Implements the staged temp-table download flow and commit sequence for atomic-ish `pull_dataset`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/func/test_read_dataset_remote.py

src/datachain/catalog/catalog.py

tests/func/test_read_dataset_remote.py

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/datachain/utils.py

src/datachain/data_storage/metastore.py

src/datachain/catalog/catalog.py

src/datachain/utils.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/datachain/utils.py

src/datachain/catalog/catalog.py

Copilot · 2026-02-12T20:08:29Z

@shcheklein I've opened a new pull request, #1585, to work on those changes. Once the pull request is ready, I'll request review from you.

…aise (#1585) * Initial plan * Use `except Exception:` instead of bare `except:` for better interrupt handling Co-authored-by: shcheklein <3659196+shcheklein@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: shcheklein <3659196+shcheklein@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/datachain/data_storage/sqlite.py

tests/func/test_read_dataset_remote.py

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/datachain/catalog/catalog.py

src/datachain/utils.py

Resolved conflicts: - metastore.py: merged imports (timedelta from our branch + nullcontext from main) - test_utils.py: kept both our interprocess_file_lock tests and main's checkpoint tests Reconciled duplicate warehouse.rename_table: - Removed our string-based rename_table(old_name, new_name) from warehouse.py - Adopted main's Table-based rename_table(old_table, new_name) -> sa.Table - Updated catalog.py to get_table() before calling rename_table() - Metadata cache eviction now handled by db_engine.rename_table (sqlite.py) Related API changes from main absorbed cleanly: - copy_table -> insert_into rename - insert_dataset_rows removal (already removed in our branch) - create_table_from_query (new staging pattern for UDF tables) - create_pre_udf_table now takes name parameter - TableRenameError / TableMissingError in error.py - is_temp_table_name no longer matches UDF prefix

amritghimire · 2026-02-16T06:00:36Z

src/datachain/utils.py

+                lock.release()
+                try:
+                    os.remove(lock_path)
+                except OSError:
+                    pass


Lock file is deleted after release, which can break inter-process exclusion. Removing the lock file after release() creates a race where another process can acquire a lock on the old inode while a third process recreates the path and acquires a second lock, breaking mutual exclusion.

amritghimire · 2026-02-16T06:02:56Z

src/datachain/data_storage/metastore.py

        projects.
        """

+    def get_dataset_by_version_uuid(


Do we need this to be abstract method?

ilongin · 2026-02-16T11:04:15Z

src/datachain/data_storage/warehouse.py

    ) -> sa.Table:
        """Creates a dataset rows table for the given dataset name and columns"""

+    def create_temp_dataset_table(


I think it won't clash. I do think that maybe this wrapper is not even needed since you actually just call 2 methods in it and I think it's used only in one place, but it's a minor

shcheklein self-assigned this Jan 31, 2026

shcheklein marked this pull request as draft January 31, 2026 22:06

github-advanced-security bot found potential problems Feb 4, 2026

View reviewed changes

tests/func/test_read_dataset_remote.py Fixed Show fixed Hide fixed

shcheklein force-pushed the make-pull-atomic branch from 4670df8 to f387f9c Compare February 5, 2026 18:18

shcheklein commented Feb 5, 2026

View reviewed changes

shcheklein force-pushed the make-pull-atomic branch 6 times, most recently from 1a73529 to e69e554 Compare February 5, 2026 22:01

shcheklein requested a review from Copilot February 8, 2026 22:07

Copilot started reviewing on behalf of shcheklein February 8, 2026 22:08 View session

Copilot AI reviewed Feb 8, 2026

View reviewed changes

shcheklein force-pushed the make-pull-atomic branch 4 times, most recently from 5c90d36 to def059c Compare February 11, 2026 02:02

shcheklein added 9 commits February 11, 2026 20:30

add basic pull failure and cleanup tests

c679f1f

cleanup remote read_dataset tests

98e8b8a

add cleanup on failure to read_dataset remote

6e0ff10

remote read: kill in the midle cleanup

017ec7b

firs pass to implement this

de7163b

address review findings

c574bb5

add more tests

9d0904c

keep addressing edge concurrency edge cases

080b7ff

more tests, more edge case, cleanup messages

3cbd8cd

shcheklein requested a review from Copilot February 12, 2026 07:00

Copilot started reviewing on behalf of shcheklein February 12, 2026 07:00 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

src/datachain/utils.py Outdated Show resolved Hide resolved

src/datachain/data_storage/metastore.py Show resolved Hide resolved

src/datachain/catalog/catalog.py Show resolved Hide resolved

src/datachain/utils.py Outdated Show resolved Hide resolved

shcheklein and others added 3 commits February 12, 2026 11:13

Update src/datachain/utils.py

d7007ee

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/datachain/catalog/catalog.py

3b534da

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

addres more review comments

7abbdee

shcheklein requested a review from Copilot February 12, 2026 19:24

Copilot started reviewing on behalf of shcheklein February 12, 2026 19:24 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

src/datachain/utils.py Outdated Show resolved Hide resolved

src/datachain/catalog/catalog.py Outdated Show resolved Hide resolved

src/datachain/catalog/catalog.py Show resolved Hide resolved

Copilot AI mentioned this pull request Feb 12, 2026

Use except Exception: instead of bare except: for cleanup-and-reraise #1585

Merged

address more reviews

c8dc769

shcheklein requested a review from Copilot February 12, 2026 20:11

Copilot started reviewing on behalf of shcheklein February 12, 2026 20:12 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

src/datachain/data_storage/sqlite.py Outdated Show resolved Hide resolved

tests/func/test_read_dataset_remote.py Show resolved Hide resolved

address more reviews

69b8624

shcheklein requested a review from Copilot February 12, 2026 20:38

Copilot started reviewing on behalf of shcheklein February 12, 2026 20:38 View session

add more tests for pull

1bc23d7

Copilot AI reviewed Feb 12, 2026

View reviewed changes

src/datachain/catalog/catalog.py Show resolved Hide resolved

src/datachain/utils.py Show resolved Hide resolved

fix tests coverage

c5b4e55

shcheklein requested review from dreadatour and ilongin February 12, 2026 21:41

shcheklein force-pushed the make-pull-atomic branch from a359218 to 1285489 Compare February 15, 2026 19:17

amritghimire reviewed Feb 16, 2026

View reviewed changes

ilongin approved these changes Feb 16, 2026

View reviewed changes

Merge branch 'main' into make-pull-atomic

879b7d2

Conversation

shcheklein commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloudflare-workers-and-pages bot commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain with Cloudflare Pages

Uh oh!

codecov bot commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

shcheklein Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

shcheklein Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

ilongin Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Feb 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

amritghimire Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

amritghimire Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

ilongin Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shcheklein commented Jan 31, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Jan 31, 2026 •

edited

Loading

codecov bot commented Jan 31, 2026 •

edited

Loading