Skip to content

feat: implement 2PC support in parquet writer#959

Open
siiddhantt wants to merge 18 commits into
datazip-inc:stagingfrom
siiddhantt:feat/parquet-2pc
Open

feat: implement 2PC support in parquet writer#959
siiddhantt wants to merge 18 commits into
datazip-inc:stagingfrom
siiddhantt:feat/parquet-2pc

Conversation

@siiddhantt

@siiddhantt siiddhantt commented May 20, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes #856

Adds table-local 2PC metadata for the Parquet destination under:

<namespace>/<table>/_olake_2pc/

The goal is to make Parquet recover from destination-side committed progress instead of relying only on the external source state file. This is needed across full refresh, incremental, and CDC syncs:

  • full refresh needs destination-side tracking for chunk/thread IDs that were already committed, so retries do not rewrite completed chunks
  • incremental and CDC need committed cursor/LSN metadata to be recoverable if OLake fails before saving the external state file
  • no-op CDC/incremental syncs still need a destination-side progress record when the cursor/LSN advances without producing Parquet data files
  • interrupted Parquet writes should remain hidden under _olake_2pc instead of leaving final-path orphan files for later retries to reason about

This PR keeps the main change inside destination/parquet and uses hidden per-thread staging plus a completed marker:

_olake_2pc/<escaped-thread-id>.staging/_completed.json

Protocol:

  • data files are first written under _olake_2pc/<escaped-thread-id>.staging/
  • after staged files are durable, Close() writes _completed.json
  • staged Parquet files are then promoted to the final table paths
  • _completed.json is kept as the durable Parquet-local progress marker for that thread
  • Setup() calls load2PCState() to recover table-local metadata before a retry starts
  • if setup finds staging with _completed.json, it treats that thread as committed and promotes any remaining staged files
  • if setup finds staging without _completed.json, it treats that attempt as incomplete and removes the hidden staging prefix before retry

For full refresh, _completed.json can be {} because the committed thread/chunk ID is encoded in the staging folder name. For CDC, incremental, and no-op progress, _completed.json contains the optional types.MetadataState, tying cursor/LSN progress to the same completed marker.

Implementation details:

  • added parquetDataFile so close, upload, and staging promotion use one file identity instead of reconstructing paths in multiple places
  • added metadataState() to normalize destination final state into the same types.MetadataState shape used by destination recovery
  • added local and S3 helpers for completed marker reads/writes, staging discovery, staging promotion, and incomplete staging cleanup
  • removed the earlier Parquet-local state.json, commits/*.json, preparing/*.json, and prepare-marker file-list protocol
  • added metadata-only completed marker support so CDC/incremental no-op syncs can still persist destination-committed cursor/LSN state
  • added a narrow Postgres CDC recovery guard so a retry can accept Parquet’s destination-committed LSN when the external source state file is stale and the replication slot is already at that committed LSN

The completed marker is the single Parquet-local progress record. If the process fails before _completed.json is written, setup treats staging as incomplete and removes it. If the process fails after _completed.json is written but before promotion finishes, setup rolls that completed staging forward.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Parquet 2PC unit tests: go test ./destination/parquet -count=1
  • Destination package tests: go test ./destination/... -count=1
  • Core state/parser tests: go test ./types ./pkg/parser -count=1
  • Postgres non-container unit tests: go test ./drivers/postgres/internal -run TestConfig -count=1
  • Binary compile: go build -o /private/tmp/olake-build-check .
  • Parquet lint: golangci-lint run ./destination/parquet
  • Manual Postgres-backed demo:
    • verified interrupted full-refresh retry completes with no incomplete staging dirs and no staged Parquet files left behind
    • verified full-refresh completed markers are written for committed chunks
    • verified incremental retry recovers cursor state from Parquet completed markers
    • verified CDC retry accepts destination-committed LSN when external source state is stale
    • verified no-op/metadata-only completed marker behavior
    • verified completed staging is promoted on recovery
    • verified incomplete hidden staging is removed before retry

Screenshots or Recordings

WIP

Documentation

  • Documentation Link: [link to README, olake.io/docs, or olake-docs]
  • N/A (bug fix, refactor, or test changes only)

Related PR's (If Any):

@siiddhantt siiddhantt requested a deployment to integration_tests May 20, 2026 13:57 — with GitHub Actions Waiting
@siiddhantt siiddhantt requested a deployment to integration_tests May 20, 2026 13:57 — with GitHub Actions Waiting
@siiddhantt siiddhantt marked this pull request as ready for review May 20, 2026 14:25
@siiddhantt siiddhantt marked this pull request as draft May 21, 2026 10:11
@siiddhantt siiddhantt requested a deployment to integration_tests May 27, 2026 07:27 — with GitHub Actions Waiting
@siiddhantt siiddhantt requested a deployment to integration_tests May 27, 2026 07:27 — with GitHub Actions Waiting
@siiddhantt siiddhantt marked this pull request as ready for review May 27, 2026 09:51
@siiddhantt

Copy link
Copy Markdown
Contributor Author

Updated the PR with the _completed.json-inside-staging approach discussed in #856.

Current Parquet 2PC shape:

<namespace>/<table>/_olake_2pc/<escaped-thread-id>.staging/_completed.json

Summary of the refactor:

  • kept _completed.json as the durable completed marker inside each staging folder
  • full refresh uses {} markers and derives committed chunk/thread IDs from the staging folder name
  • CDC/incremental/no-op progress stores optional types.MetadataState in _completed.json
  • setup promotes completed staging and removes incomplete staging before retry

Also reran the focused/unit/build/lint checks and the Postgres-backed manual demo for full refresh, incremental, and CDC recovery.

@siiddhantt

Copy link
Copy Markdown
Contributor Author

@nayanj98 looping you in as @vaibhav-datazip suggested.

This PR is ready for review now and the description is updated with the final flow and test coverage.

@nayanj98

nayanj98 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

@siiddhantt Assigning @vaibhav-datazip to review your PR.

@nayanj98 nayanj98 requested a review from vaibhav-datazip June 9, 2026 12:14
@siiddhantt

Copy link
Copy Markdown
Contributor Author

@vaibhav-datazip I found the CI failure. It looks test-harness related rather than a 2PC flow failure.

The integration cleanup helper deletes only final .parquet files between Parquet subtests, but now _olake_2pc is part of the table-local destination state. That leaves stale _completed.json markers behind, so the next subtest can skip already-committed chunks while the actual data files were manually removed by the test helper.

I’ll push a small fix to clear _olake_2pc/ in the Parquet integration cleanup as well, matching how DropStreams() clears the full table prefix.

@siiddhantt siiddhantt temporarily deployed to integration_tests June 10, 2026 14:18 — with GitHub Actions Inactive
@siiddhantt siiddhantt temporarily deployed to integration_tests June 10, 2026 14:19 — with GitHub Actions Inactive
Comment thread destination/parquet/parquet.go Outdated
}

func (p *Parquet) closePqFiles(ctx context.Context, _ any, closeOnError bool) error {
func (p *Parquet) dataFiles() []parquetDataFile {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rename this function to a more appropriate name, also add a comment for this function

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, renamed this to stagedDataFiles() and added a short comment. The helper returns the current thread’s staged parquet file identities used during close/upload/promotion, so the old dataFiles() name was too generic.

Comment on lines -149 to +152
if len(remainingStreams) > 0 {
// Destination 2PC metadata can be ahead of the external source state when
// the destination commit succeeded but saving the source state failed. If all
// streams are already committed and the slot is exactly at that committed LSN,
// there is no WAL left to replay; PostCDC will persist the recovered state.
skipValidation := recoveryLSN != nil && len(remainingStreams) == 0 && *recoveryLSN == slot.LSN
if !skipValidation {
// validate global state (might got invalid during full load)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the need of this change ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the stale external state recovery case.

Scenario:

  • Parquet commits destination progress at LSN B in _completed.json
  • the Postgres slot is already at B
  • OLake crashes before the external source state file is saved, so the next run still starts from old state A

Without this guard, Postgres fails on the A vs B LSN mismatch before destination metadata can repair the state. The skip is intentionally narrow: it only applies when destination metadata is ahead, all streams are already committed, and the replication slot is exactly at the destination-committed LSN. Other mismatches still fail as before.

Comment thread destination/parquet/parquet_2pc.go Outdated
return filepath.Join(p.local2PCPath(), p.stagingDirName(threadID))
}

func (p *Parquet) stagingDataDir(basePath string) string {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change the variable name of the argument, also can you add short comments for the functions in this file

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, renamed the argument to avoid confusion with p.basePath, and added short comments around the 2PC helpers.

}

func (p *Parquet) listS3CompletedMarkers(ctx context.Context) ([]parquet2PCCompletedMarker, error) {
prefix := p.s3ObjectPath(filepath.Join(p.basePath, parquet2PCDir)) + "/"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we saving staging files under _olake_2pc folder and not directly in the root folder

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping staging under _olake_2pc is intentional. Staged files are not committed table data yet, so placing them under the table root can make raw Parquet readers see partial/uncommitted files. _olake_2pc keeps staging hidden but still table-local, and DropStreams() clears it naturally with the table prefix.

Comment thread destination/parquet/parquet_2pc.go Outdated
type parquet2PCCompletedMarker struct {
ThreadID string
MetadataState *types.MetadataState
modTime time.Time

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need modTime ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need an ordering value because CDC/incremental/no-op syncs can leave multiple completed markers over time, and setup needs to restore the latest metadata state. I renamed modTime to completedAt and added a short comment to make that purpose clearer.

Comment thread destination/parquet/parquet_2pc.go Outdated
Comment on lines +459 to +476
err := p.s3Client.ListObjectsPagesWithContext(ctx, &s3.ListObjectsInput{
Bucket: aws.String(p.config.Bucket),
Prefix: aws.String(prefix),
}, func(page *s3.ListObjectsOutput, _ bool) bool {
keys := make([]string, 0, len(page.Contents))
for _, obj := range page.Contents {
if obj.Key != nil {
keys = append(keys, *obj.Key)
}
}
if len(keys) == 0 {
return true
}
concurrency := min(len(keys), 8)
pageErr = utils.Concurrent(ctx, keys, concurrency, func(_ context.Context, key string, _ int) error {
return p.deleteS3Object(ctx, key)
})
return pageErr == nil

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retries should be added in this as well as other s3 related functions as well like we have in clearS3files

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, added retries for the new S3 2PC operations using the same RetryWithSkip + isRateLimitError pattern used by existing Parquet S3 cleanup. This now covers staging upload, marker read/write, listing, copy, and delete paths.

Comment thread destination/parquet/parquet_2pc.go Outdated
Comment on lines +172 to +175
marker, err := p.readLocalCompletedMarker(entry.Name())
if os.IsNotExist(err) {
continue
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we create a map here only and not depend on list staging dirs function at all, we will mark the folders without completed marker as false otherwise true and data will be there

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, refactored this into a single staging-entry map. Each staging dir now maps to either a completed marker or nil, so setup can promote committed staging and delete incomplete staging without separately listing completed markers and staging dirs.

@vaibhav-datazip

Copy link
Copy Markdown
Collaborator

@siiddhantt once you are done with your changes after the review , please re-request the review from here so that I know the pr is ready again
Screenshot 2026-06-18 at 11 22 04 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants