-
Notifications
You must be signed in to change notification settings - Fork 77
Description
Feature Request: Support for add_files Functionality
Problem Statement
Currently, pg_lake does not provide a way to register existing Parquet/ORC/Avro files into an Iceberg table without rewriting the data. This is a common use case when:
- Migrating existing data lakes to Iceberg
- Integrating external data that's already in optimal file formats
- Building incremental pipelines where data is written by other systems
- Avoiding unnecessary data rewrites for cost and performance reasons
While PostgreSQL's COPY command can import data, it rewrites the underlying files, which defeats the purpose of registering existing data in place.
Proposed Solution
Iceberg's Java and Python APIs provide an add_files method that allows programmatic registration of existing data files into the catalog without rewriting them. I'd like to request similar functionality in pg_lake, preferably through one of these approaches:
Option 1: SQL Procedure/Function
-- Register existing files into an Iceberg table
CALL iceberg.add_files(
table_name := 'my_table',
file_paths := ARRAY[
's3://bucket/data/file1.parquet',
's3://bucket/data/file2.parquet'
]
);Option 2: Catalog Interoperability
Expose pg_lake's catalog in a way that's compatible with PyIceberg's SqlCatalog or provide a REST catalog interface, allowing users to use PyIceberg for metadata operations like add_files while querying through pg_lake.
Use Case Example
I have a data lake with thousands of existing Parquet files organized by date. I want to:
- Create an Iceberg table in pg_lake with the appropriate schema
- Register the existing Parquet files without copying/rewriting them
- Query the data through PostgreSQL using pg_lake
- Continue adding new files as they arrive
Related Issues
This is related to #41 regarding catalog interoperability. Having either native add_files support or a way to use PyIceberg with pg_lake's catalog would solve this use case.
Benefits
- Enables zero-copy data lake migrations to Iceberg
- Reduces storage costs and migration time
- Allows pg_lake to integrate with existing data pipelines
- Provides feature parity with standard Iceberg tooling
Would love to hear the maintainers' thoughts on this! Happy to provide more details or help test if this feature is being considered.