Skip to content

fix: add primary keys to stats tables to prevent ctid-based update conflicts#1

Open
fuziontech wants to merge 1 commit into
mainfrom
fix/add-primary-key-to-table-stats
Open

fix: add primary keys to stats tables to prevent ctid-based update conflicts#1
fuziontech wants to merge 1 commit into
mainfrom
fix/add-primary-key-to-table-stats

Conversation

@fuziontech
Copy link
Copy Markdown
Member

@fuziontech fuziontech commented Dec 12, 2025

Summary

Adds PRIMARY KEY constraints and indexes to DuckLake metadata tables to prevent PostgreSQL serialization failures under concurrent writes and improve query performance.

Problem

When using PostgreSQL as the metadata catalog with multiple concurrent writers (e.g., 8 Kafka Connect tasks writing to the same table), we see frequent errors:

ERROR: could not serialize access due to concurrent update
Exceeded the maximum retry count of 10 set by the ducklake_max_retry_count setting.

Root Cause

Many DuckLake metadata tables lack primary keys. When postgres_scanner executes UPDATE/DELETE statements on tables without primary keys, it falls back to using ctid (PostgreSQL's physical row identifier) to locate rows. The ctid changes whenever a row is updated (due to PostgreSQL's MVCC), causing concurrent updates to fail.

Changes

Primary Keys Added

Table Primary Key
ducklake_table_stats table_id
ducklake_table_column_stats (table_id, column_id)
ducklake_file_column_stats (data_file_id, column_id)
ducklake_partition_info partition_id
ducklake_partition_column (partition_id, partition_key_index)
ducklake_file_partition_value (data_file_id, partition_key_index)
ducklake_files_scheduled_for_deletion data_file_id
ducklake_inlined_data_tables (table_id, schema_version)
ducklake_column_mapping mapping_id
ducklake_name_mapping (mapping_id, column_id)
ducklake_macro_impl (macro_id, impl_id)
ducklake_macro_parameters (macro_id, impl_id, column_id)

Indexes Added

For frequently queried columns (especially table_id lookups in data file queries):

Index Columns
idx_data_file_table_snapshot (table_id, begin_snapshot, end_snapshot)
idx_delete_file_table_snapshot (table_id, begin_snapshot, end_snapshot)
idx_column_table (table_id, end_snapshot)
idx_file_column_stats_table (table_id, column_id)
idx_partition_info_table (table_id)
idx_partition_column_table (table_id)
idx_file_partition_value_table (table_id)

Migration for Existing Catalogs

For existing PostgreSQL metadata catalogs, run these ALTER statements:

-- Primary Keys
ALTER TABLE ducklake_table_stats ADD PRIMARY KEY (table_id);
ALTER TABLE ducklake_table_column_stats ADD PRIMARY KEY (table_id, column_id);
ALTER TABLE ducklake_file_column_stats ADD PRIMARY KEY (data_file_id, column_id);
ALTER TABLE ducklake_partition_info ADD PRIMARY KEY (partition_id);
ALTER TABLE ducklake_partition_column ADD PRIMARY KEY (partition_id, partition_key_index);
ALTER TABLE ducklake_file_partition_value ADD PRIMARY KEY (data_file_id, partition_key_index);
ALTER TABLE ducklake_files_scheduled_for_deletion ADD PRIMARY KEY (data_file_id);
ALTER TABLE ducklake_inlined_data_tables ADD PRIMARY KEY (table_id, schema_version);
ALTER TABLE ducklake_column_mapping ADD PRIMARY KEY (mapping_id);
ALTER TABLE ducklake_name_mapping ADD PRIMARY KEY (mapping_id, column_id);
ALTER TABLE ducklake_macro_impl ADD PRIMARY KEY (macro_id, impl_id);
ALTER TABLE ducklake_macro_parameters ADD PRIMARY KEY (macro_id, impl_id, column_id);

-- Indexes
CREATE INDEX idx_data_file_table_snapshot ON ducklake_data_file(table_id, begin_snapshot, end_snapshot);
CREATE INDEX idx_delete_file_table_snapshot ON ducklake_delete_file(table_id, begin_snapshot, end_snapshot);
CREATE INDEX idx_column_table ON ducklake_column(table_id, end_snapshot);
CREATE INDEX idx_file_column_stats_table ON ducklake_file_column_stats(table_id, column_id);
CREATE INDEX idx_partition_info_table ON ducklake_partition_info(table_id);
CREATE INDEX idx_partition_column_table ON ducklake_partition_column(table_id);
CREATE INDEX idx_file_partition_value_table ON ducklake_file_partition_value(table_id);

Test Plan

  • Existing tests pass
  • Deploy to prod with 8 concurrent Kafka Connect tasks and verify no serialization errors
  • Verify query performance improvement on large catalogs

🤖 Generated with Claude Code

…rite safety

When using PostgreSQL as the metadata catalog with multiple concurrent
writers (e.g., Kafka Connect tasks), UPDATE/DELETE statements on tables
without primary keys cause postgres_scanner to use ctid (physical row ID)
for row identification. This leads to serialization failures because ctid
changes when rows are updated due to PostgreSQL's MVCC.

## Primary Keys Added

Tables that previously had no primary key now have one:

| Table | Primary Key |
|-------|-------------|
| ducklake_table_stats | table_id |
| ducklake_table_column_stats | (table_id, column_id) |
| ducklake_file_column_stats | (data_file_id, column_id) |
| ducklake_partition_info | partition_id |
| ducklake_partition_column | (partition_id, partition_key_index) |
| ducklake_file_partition_value | (data_file_id, partition_key_index) |
| ducklake_files_scheduled_for_deletion | data_file_id |
| ducklake_inlined_data_tables | (table_id, schema_version) |
| ducklake_column_mapping | mapping_id |
| ducklake_name_mapping | (mapping_id, column_id) |
| ducklake_macro_impl | (macro_id, impl_id) |
| ducklake_macro_parameters | (macro_id, impl_id, column_id) |

## Indexes Added

For frequently queried columns (especially table_id lookups):

- idx_data_file_table_snapshot: (table_id, begin_snapshot, end_snapshot)
- idx_delete_file_table_snapshot: (table_id, begin_snapshot, end_snapshot)
- idx_column_table: (table_id, end_snapshot)
- idx_file_column_stats_table: (table_id, column_id)
- idx_partition_info_table: (table_id)
- idx_partition_column_table: (table_id)
- idx_file_partition_value_table: (table_id)

This eliminates "could not serialize access due to concurrent update"
errors and improves query performance for table-scoped operations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant