fix: add primary keys to stats tables to prevent ctid-based update conflicts by fuziontech · Pull Request #1 · PostHog/ducklake

fuziontech · 2025-12-12T00:05:46Z

Summary

Adds PRIMARY KEY constraints and indexes to DuckLake metadata tables to prevent PostgreSQL serialization failures under concurrent writes and improve query performance.

Problem

When using PostgreSQL as the metadata catalog with multiple concurrent writers (e.g., 8 Kafka Connect tasks writing to the same table), we see frequent errors:

ERROR: could not serialize access due to concurrent update
Exceeded the maximum retry count of 10 set by the ducklake_max_retry_count setting.

Root Cause

Many DuckLake metadata tables lack primary keys. When postgres_scanner executes UPDATE/DELETE statements on tables without primary keys, it falls back to using ctid (PostgreSQL's physical row identifier) to locate rows. The ctid changes whenever a row is updated (due to PostgreSQL's MVCC), causing concurrent updates to fail.

Changes

Primary Keys Added

Table	Primary Key
ducklake_table_stats	table_id
ducklake_table_column_stats	(table_id, column_id)
ducklake_file_column_stats	(data_file_id, column_id)
ducklake_partition_info	partition_id
ducklake_partition_column	(partition_id, partition_key_index)
ducklake_file_partition_value	(data_file_id, partition_key_index)
ducklake_files_scheduled_for_deletion	data_file_id
ducklake_inlined_data_tables	(table_id, schema_version)
ducklake_column_mapping	mapping_id
ducklake_name_mapping	(mapping_id, column_id)
ducklake_macro_impl	(macro_id, impl_id)
ducklake_macro_parameters	(macro_id, impl_id, column_id)

Indexes Added

For frequently queried columns (especially table_id lookups in data file queries):

Index	Columns
idx_data_file_table_snapshot	(table_id, begin_snapshot, end_snapshot)
idx_delete_file_table_snapshot	(table_id, begin_snapshot, end_snapshot)
idx_column_table	(table_id, end_snapshot)
idx_file_column_stats_table	(table_id, column_id)
idx_partition_info_table	(table_id)
idx_partition_column_table	(table_id)
idx_file_partition_value_table	(table_id)

Migration for Existing Catalogs

For existing PostgreSQL metadata catalogs, run these ALTER statements:

-- Primary Keys
ALTER TABLE ducklake_table_stats ADD PRIMARY KEY (table_id);
ALTER TABLE ducklake_table_column_stats ADD PRIMARY KEY (table_id, column_id);
ALTER TABLE ducklake_file_column_stats ADD PRIMARY KEY (data_file_id, column_id);
ALTER TABLE ducklake_partition_info ADD PRIMARY KEY (partition_id);
ALTER TABLE ducklake_partition_column ADD PRIMARY KEY (partition_id, partition_key_index);
ALTER TABLE ducklake_file_partition_value ADD PRIMARY KEY (data_file_id, partition_key_index);
ALTER TABLE ducklake_files_scheduled_for_deletion ADD PRIMARY KEY (data_file_id);
ALTER TABLE ducklake_inlined_data_tables ADD PRIMARY KEY (table_id, schema_version);
ALTER TABLE ducklake_column_mapping ADD PRIMARY KEY (mapping_id);
ALTER TABLE ducklake_name_mapping ADD PRIMARY KEY (mapping_id, column_id);
ALTER TABLE ducklake_macro_impl ADD PRIMARY KEY (macro_id, impl_id);
ALTER TABLE ducklake_macro_parameters ADD PRIMARY KEY (macro_id, impl_id, column_id);

-- Indexes
CREATE INDEX idx_data_file_table_snapshot ON ducklake_data_file(table_id, begin_snapshot, end_snapshot);
CREATE INDEX idx_delete_file_table_snapshot ON ducklake_delete_file(table_id, begin_snapshot, end_snapshot);
CREATE INDEX idx_column_table ON ducklake_column(table_id, end_snapshot);
CREATE INDEX idx_file_column_stats_table ON ducklake_file_column_stats(table_id, column_id);
CREATE INDEX idx_partition_info_table ON ducklake_partition_info(table_id);
CREATE INDEX idx_partition_column_table ON ducklake_partition_column(table_id);
CREATE INDEX idx_file_partition_value_table ON ducklake_file_partition_value(table_id);

Test Plan

Existing tests pass
Deploy to prod with 8 concurrent Kafka Connect tasks and verify no serialization errors
Verify query performance improvement on large catalogs

🤖 Generated with Claude Code

…rite safety When using PostgreSQL as the metadata catalog with multiple concurrent writers (e.g., Kafka Connect tasks), UPDATE/DELETE statements on tables without primary keys cause postgres_scanner to use ctid (physical row ID) for row identification. This leads to serialization failures because ctid changes when rows are updated due to PostgreSQL's MVCC. ## Primary Keys Added Tables that previously had no primary key now have one: | Table | Primary Key | |-------|-------------| | ducklake_table_stats | table_id | | ducklake_table_column_stats | (table_id, column_id) | | ducklake_file_column_stats | (data_file_id, column_id) | | ducklake_partition_info | partition_id | | ducklake_partition_column | (partition_id, partition_key_index) | | ducklake_file_partition_value | (data_file_id, partition_key_index) | | ducklake_files_scheduled_for_deletion | data_file_id | | ducklake_inlined_data_tables | (table_id, schema_version) | | ducklake_column_mapping | mapping_id | | ducklake_name_mapping | (mapping_id, column_id) | | ducklake_macro_impl | (macro_id, impl_id) | | ducklake_macro_parameters | (macro_id, impl_id, column_id) | ## Indexes Added For frequently queried columns (especially table_id lookups): - idx_data_file_table_snapshot: (table_id, begin_snapshot, end_snapshot) - idx_delete_file_table_snapshot: (table_id, begin_snapshot, end_snapshot) - idx_column_table: (table_id, end_snapshot) - idx_file_column_stats_table: (table_id, column_id) - idx_partition_info_table: (table_id) - idx_partition_column_table: (table_id) - idx_file_partition_value_table: (table_id) This eliminates "could not serialize access due to concurrent update" errors and improves query performance for table-scoped operations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fuziontech force-pushed the fix/add-primary-key-to-table-stats branch from d1b7597 to a61a2d2 Compare December 12, 2025 00:16

fuziontech mentioned this pull request May 18, 2026

[v1.5] fix: add primary keys and indexes to metadata tables for concurrent write safety #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add primary keys to stats tables to prevent ctid-based update conflicts#1

fix: add primary keys to stats tables to prevent ctid-based update conflicts#1
fuziontech wants to merge 1 commit into
mainfrom
fix/add-primary-key-to-table-stats

fuziontech commented Dec 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fuziontech commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause

Changes

Primary Keys Added

Indexes Added

Migration for Existing Catalogs

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fuziontech commented Dec 12, 2025 •

edited

Loading