Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single source Iceberg #1032

Draft
wants to merge 6 commits into
base: beta
Choose a base branch
from
Draft

Single source Iceberg #1032

wants to merge 6 commits into from

Conversation

kbatuigas
Copy link
Contributor

Description

Resolves https://redpandadata.atlassian.net/browse/
Review deadline:

Page previews

Checks

  • New feature
  • Content gap
  • Support Follow-up
  • Small fix (typos, links, copyedits, etc)

Copy link

netlify bot commented Mar 25, 2025

Deploy Preview for redpanda-docs-preview ready!

Name Link
🔨 Latest commit 4b6dd3a
🔍 Latest deploy log https://app.netlify.com/sites/redpanda-docs-preview/deploys/67e35e6b247b7900083ce90d
😎 Deploy Preview https://deploy-preview-1032--redpanda-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@@ -0,0 +1,471 @@
The Apache Iceberg integration for Redpanda allows you to store topic data in the cloud in the Iceberg open table format. This makes your streaming data immediately available in downstream analytical systems, including data warehouses like Snowflake, Databricks, ClickHouse, and Redshift, without setting up and maintaining additional ETL pipelines. You can also integrate your data directly into commonly-used big data processing frameworks, such as Apache Spark and Flink, standardizing and simplifying the consumption of streams as tables in a wide variety of data analytics pipelines.

Redpanda supports https://iceberg.apache.org/spec/#format-versioning[version 2^] of the Iceberg table format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The sentence starts with a command (Redpanda supports). According to the style guide, sentences should not start with a command or code. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
Redpanda supports https://iceberg.apache.org/spec/#format-versioning[version 2^] of the Iceberg table format.
The Iceberg table format, supported by Redpanda, adheres to https://iceberg.apache.org/spec/#format-versioning[version 2^].

The original sentence starts with 'Redpanda supports', which is a command-like structure. By rephrasing it to 'The Iceberg table format, supported by Redpanda, adheres to...', we maintain the meaning while aligning with the style guide's recommendation. This structure is more descriptive and integrates the command into the sentence naturally.


== Iceberg concepts

https://iceberg.apache.org[Apache Iceberg^] is an open source format specification for defining structured tables in a data lake. The table format lets you quickly and easily manage, query, and process huge amounts of structured and unstructured data. This is similar to the way in which you would manage and run SQL queries against relational data in a database or data warehouse. The open format lets you use many different languages, tools, and applications to process the same data in a consistent way, so you can avoid vendor lock-in. This data management system is also known as a _data lakehouse_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The sentence starts with a command (Apache Iceberg). According to the style guide, sentences should not start with a command or code. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
https://iceberg.apache.org[Apache Iceberg^] is an open source format specification for defining structured tables in a data lake. The table format lets you quickly and easily manage, query, and process huge amounts of structured and unstructured data. This is similar to the way in which you would manage and run SQL queries against relational data in a database or data warehouse. The open format lets you use many different languages, tools, and applications to process the same data in a consistent way, so you can avoid vendor lock-in. This data management system is also known as a _data lakehouse_.
The open-source format specification, Apache Iceberg, is designed for defining structured tables in a data lake.

The original sentence started with Apache Iceberg, which was flagged by the style guide. By rephrasing the sentence to start with a descriptive phrase, we maintain the meaning while adhering to the style guide. This change also improves readability by providing context before introducing the specific term Apache Iceberg. The rest of the paragraph remains unchanged as it does not violate the style guide.

--
+
For Iceberg-enabled topics, the manifest files are in JSON format.
* Catalog: Contains the current metadata pointer for the table. Clients reading and writing data to the table see the same version of the current state of the table. The Iceberg integration supports two xref:manage:iceberg/use-iceberg-catalogs.adoc[catalog integration] types. You can configure Redpanda to catalog files stored in the same object storage bucket or container where the Iceberg data files are located, or you can configure Redpanda to use an https://iceberg.apache.org/terms/#decoupling-using-the-rest-catalog[Iceberg REST catalog^] endpoint to update an externally-managed catalog when there are changes to the Iceberg data and metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The sentence starts with a command (Catalog). According to the style guide, sentences should not start with a command or code. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
* Catalog: Contains the current metadata pointer for the table. Clients reading and writing data to the table see the same version of the current state of the table. The Iceberg integration supports two xref:manage:iceberg/use-iceberg-catalogs.adoc[catalog integration] types. You can configure Redpanda to catalog files stored in the same object storage bucket or container where the Iceberg data files are located, or you can configure Redpanda to use an https://iceberg.apache.org/terms/#decoupling-using-the-rest-catalog[Iceberg REST catalog^] endpoint to update an externally-managed catalog when there are changes to the Iceberg data and metadata.
The catalog contains the current metadata pointer for the table. Clients reading and writing data to the table see the same version of the current state of the table. The Iceberg integration supports two xref:manage:iceberg/use-iceberg-catalogs.adoc[catalog integration] types. You can configure Redpanda to catalog files stored in the same object storage bucket or container where the Iceberg data files are located, or you can configure Redpanda to use an https://iceberg.apache.org/terms/#decoupling-using-the-rest-catalog[Iceberg REST catalog^] endpoint to update an externally-managed catalog when there are changes to the Iceberg data and metadata.

The original sentence started with 'Catalog', which is not recommended according to the style guide. By rephrasing it to 'The catalog contains...', we maintain the original meaning while adhering to the style guidelines. This change improves readability and aligns with the documentation standards.


image::shared:iceberg-integration-optimized.png[Redpanda's Iceberg integration]

When you enable the Iceberg integration for a Redpanda topic, Redpanda brokers store streaming data in the Iceberg-compatible format in Parquet files in object storage, in addition to the log segments uploaded using Tiered Storage. Storing the streaming data in Iceberg tables in the cloud allows you to derive real-time insights through many compatible data lakehouse, data engineering, and business intelligence https://iceberg.apache.org/vendors/[tools^].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The sentence starts with a command (Storing). According to the style guide, sentences should not start with a command or code. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
When you enable the Iceberg integration for a Redpanda topic, Redpanda brokers store streaming data in the Iceberg-compatible format in Parquet files in object storage, in addition to the log segments uploaded using Tiered Storage. Storing the streaming data in Iceberg tables in the cloud allows you to derive real-time insights through many compatible data lakehouse, data engineering, and business intelligence https://iceberg.apache.org/vendors/[tools^].
When you enable the Iceberg integration for a Redpanda topic, Redpanda brokers store streaming data in the Iceberg-compatible format in Parquet files in object storage, in addition to the log segments uploaded using Tiered Storage. This approach of storing the streaming data in Iceberg tables in the cloud allows you to derive real-time insights through many compatible data lakehouse, data engineering, and business intelligence https://iceberg.apache.org/vendors/[tools^].

The original sentence started with 'Storing', which is against the style guide. I've rephrased it to start with 'This approach of storing', which integrates the command into a descriptive sentence. This maintains the original meaning while adhering to the style guide. The rest of the sentence remains unchanged to preserve the original context and information.


== Limitations

* It is not possible to append topic data to an existing Iceberg table that is not created by Redpanda.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The sentence starts with a command (It is not possible). According to the style guide, sentences should not start with a command or code. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
* It is not possible to append topic data to an existing Iceberg table that is not created by Redpanda.
* Appending topic data to an existing Iceberg table that was not created by Redpanda is not possible.

The original sentence starts with 'It is not possible,' which is a command-like structure. Rephrasing it to start with the subject 'Appending topic data' makes it more descriptive and aligns with the style guide. This change maintains the original meaning while adhering to the guidelines.

<new-topic-name> OK
----

. Register a schema for the topic. This step is required for the `value_schema_id_prefix` mode, but is optional otherwise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The sentence starts with a command (Register a schema). According to the style guide, sentences should not start with a command or code. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
. Register a schema for the topic. This step is required for the `value_schema_id_prefix` mode, but is optional otherwise.
To register a schema for the topic, complete this step, which is required for the `value_schema_id_prefix` mode but optional otherwise.

The original sentence began with a command, which is against the style guide's recommendations. The revised sentence starts with an introductory phrase, making it more descriptive and informative. This change maintains the original meaning while adhering to the style guide.

----

ifdef::env-byoc[]
To query the Iceberg table, you need access to the object storage bucket or container where the Iceberg data is stored. For BYOC clusters on AWS and GCP, the bucket name and table location are as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The sentence starts with a command (To query the Iceberg table). According to the style guide, sentences should not start with a command or code. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
To query the Iceberg table, you need access to the object storage bucket or container where the Iceberg data is stored. For BYOC clusters on AWS and GCP, the bucket name and table location are as follows:
Access to the object storage bucket or container where the Iceberg data is stored is required to query the Iceberg table. For BYOC clusters on AWS and GCP, the bucket name and table location are as follows:

The original sentence began with a command, which is not recommended according to the style guide. The rephrased sentence now starts with a descriptive statement, making it more aligned with the guidelines. This change improves readability and maintains the technical accuracy of the information.

|===
endif::[]

The Iceberg table is inside a namespace called `redpanda`, and has the same name as the Redpanda topic name. As you produce records to the topic, the data also becomes available in object storage for consumption by Iceberg-compatible clients. You can use the same analytical tools to xref:manage:iceberg/query-iceberg-topics.adoc[read the Iceberg topic data] in a data lake as you would for a relational database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The sentence starts with a command (The Iceberg table is inside a namespace). According to the style guide, sentences should not start with a command or code. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
The Iceberg table is inside a namespace called `redpanda`, and has the same name as the Redpanda topic name. As you produce records to the topic, the data also becomes available in object storage for consumption by Iceberg-compatible clients. You can use the same analytical tools to xref:manage:iceberg/query-iceberg-topics.adoc[read the Iceberg topic data] in a data lake as you would for a relational database.
The Iceberg table resides within a namespace called `redpanda`, sharing its name with the Redpanda topic. As records are produced to the topic, the data becomes available in object storage for Iceberg-compatible clients to consume. You can use the same analytical tools to xref:manage:iceberg/query-iceberg-topics.adoc[read the Iceberg topic data] in a data lake as you would for a relational database.

The original sentence started with a command, which is not recommended according to the style guide. The rephrased sentence integrates the command into a descriptive sentence, maintaining clarity and adhering to the style guide. The rest of the paragraph remains unchanged as it is clear and informative.


The xref:reference:properties/topic-properties.adoc#redpanda-iceberg-mode[`redpanda.iceberg.mode`] property determines how Redpanda maps the topic data to the Iceberg table structure. You can have the generated Iceberg table match the structure of a Avro or Protobuf schema in the Schema Registry, or you can use the `key_value` mode where Redpanda stores the record values as-is in the table.

The JSON Schema format is not supported. If your topic data is in JSON, it is recommended to use the `key_value` mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The sentence starts with a command (The JSON Schema format). According to the style guide, sentences should not start with a command or code. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
The JSON Schema format is not supported. If your topic data is in JSON, it is recommended to use the `key_value` mode.
JSON Schema format is not supported. It is recommended to use the `key_value` mode if your topic data is in JSON.

The original sentence starts with 'The JSON Schema format,' which is interpreted as a command or code. By rephrasing, we maintain clarity while adhering to the style guide. The rephrased sentence now starts with 'JSON Schema format,' which is more descriptive and avoids the issue of starting with a command or code. The meaning remains unchanged, ensuring that users understand the recommendation to use 'key_value' mode for JSON data.


=== Iceberg modes and table schemas

For both `key_value` and `value_schema_id_prefix` modes, Redpanda writes to a `redpanda` table column that stores a single Iceberg https://iceberg.apache.org/spec/#nested-types[struct^] per record, containing nested columns of the metadata from each record, including the record key, headers, timestamp, the partition it belongs to, and its offset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The sentence starts with a command (For both key_valueandvalue_schema_id_prefix modes). According to the style guide, sentences should not start with a command or code. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
For both `key_value` and `value_schema_id_prefix` modes, Redpanda writes to a `redpanda` table column that stores a single Iceberg https://iceberg.apache.org/spec/#nested-types[struct^] per record, containing nested columns of the metadata from each record, including the record key, headers, timestamp, the partition it belongs to, and its offset.
Redpanda writes to a `redpanda` table column in both `key_value` and `value_schema_id_prefix` modes. This column stores a single Iceberg https://iceberg.apache.org/spec/#nested-types[struct^] per record, containing nested columns of the metadata from each record, including the record key, headers, timestamp, the partition it belongs to, and its offset.

The original sentence started with a command, which is not recommended according to the style guide. The rephrased sentence now begins with a descriptive statement about Redpanda's action, followed by the modes in which it operates. This maintains clarity and adheres to the style guide.

Copy link
Contributor

hyperlint-ai bot commented Mar 25, 2025

PR Change Summary

Introduced the Iceberg integration for Redpanda, enabling cloud storage of topic data in the Iceberg format for improved analytics.

  • Added comprehensive documentation for the Iceberg integration, including concepts, prerequisites, and limitations.
  • Provided detailed instructions for enabling Iceberg integration and configuring topics.
  • Included examples for querying Iceberg tables and managing schema support.

Added Files

  • modules/manage/partials/iceberg/about-iceberg-topics.adoc
  • modules/manage/partials/iceberg/query-iceberg-topics.adoc

How can I customize these reviews?

Check out the Hyperlint AI Reviewer docs for more information on how to customize the review.

If you just want to ignore it on this PR, you can add the hyperlint-ignore label to the PR. Future changes won't trigger a Hyperlint review.

Note specifically for link checks, we only check the first 30 links in a file and we cache the results for several hours (for instance, if you just added a page, you might experience this). Our recommendation is to add hyperlint-ignore to the PR to ignore the link check for this PR.

What is Hyperlint?

Hyperlint is an AI agent that helps you write, edit, and maintain your documentation.

Learn more about the Hyperlint AI reviewer and the checks that we can run on your documentation.

@kbatuigas kbatuigas requested a review from simon0191 March 25, 2025 16:13
+
[,bash]
----
rpk registry schema create ClickEvent-value --schema path/to/schema.avsc --type avro
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The sentence starts with the command rpk registry schema create. According to the style guide, sentences should not start with a command. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
rpk registry schema create ClickEvent-value --schema path/to/schema.avsc --type avro
To create a schema for ClickEvent-value using Avro, use the command: `rpk registry schema create ClickEvent-value --schema path/to/schema.avsc --type avro`.

The original sentence starts with a command, which can be abrupt and lacks context. By rephrasing it to provide context first, it becomes clearer to the reader what the command is intended to do. This approach aligns with the style guide's recommendation to integrate commands into descriptive sentences.

+
[,bash]
----
echo '"key1" {"user_id":2324,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:23:59.380Z"}' | rpk topic produce ClickEvent --format='%k %v\n' --schema-id=topic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The sentence starts with the command echo. According to the style guide, sentences should not start with a command. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
echo '"key1" {"user_id":2324,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:23:59.380Z"}' | rpk topic produce ClickEvent --format='%k %v\n' --schema-id=topic
To produce a ClickEvent with a specific schema, use the following command: echo '"key1" {"user_id":2324,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:23:59.380Z"}' | rpk topic produce ClickEvent --format='%k %v\n' --schema-id=topic

The original sentence started with the command echo, which is against the style guide. By rephrasing it to start with a descriptive context, we maintain clarity and adhere to the style guide. The command itself remains unchanged to ensure technical accuracy. The rephrased sentence provides context about what the command does, which is helpful for users who may not be familiar with it.

+
[,bash]
----
echo 'key1 {"user_id":2324,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:23:59.380Z"}' | rpk topic produce ClickEvent_key_value --format='%k %v\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (code-formatting.adoc) - The sentence starts with the command echo. According to the style guide, sentences should not start with a command. It should be rephrased to integrate the command into a descriptive sentence.

Proposed fix

Suggested change
echo 'key1 {"user_id":2324,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:23:59.380Z"}' | rpk topic produce ClickEvent_key_value --format='%k %v\n'
To produce a message to the ClickEvent_key_value topic, use the command: echo 'key1 {"user_id":2324,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:23:59.380Z"}' | rpk topic produce ClickEvent_key_value --format='%k %v\n'

The original sentence started with the command echo, which is against the style guide's recommendation. By rephrasing it to provide context, the sentence now explains the purpose of the command, making it more informative and aligned with the style guide.


This provides read access to all snapshots written as of the specified table version (denoted by `version-number`).

NOTE: Redpanda automatically removes expired snapshots on a periodic basis. Snapshot expiry helps maintain a smaller metadata size and reduces the window available for <<time-travel-queries,time travel>>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (styles.adoc) - The word 'NOTE' should be formatted as an AsciiDoc note admonition according to the style guide.

Proposed fix

Suggested change
NOTE: Redpanda automatically removes expired snapshots on a periodic basis. Snapshot expiry helps maintain a smaller metadata size and reduces the window available for <<time-travel-queries,time travel>>.
[NOTE]
====
Redpanda automatically removes expired snapshots on a periodic basis. Snapshot expiry helps maintain a smaller metadata size and reduces the window available for <<time-travel-queries,time travel>>.
====

The original text uses 'NOTE:' inline, which is not the correct format for an AsciiDoc note admonition. The correct format involves using a block with '[NOTE]' followed by '====' to start and end the note block. This change ensures compliance with the style guide and improves readability by clearly distinguishing the note from the rest of the text.


In the Iceberg specification, tables consist of the following layers:

* Data layer: Stores the data in data files. The Iceberg integration currently supports the Parquet file format. Parquet files are column-based and suitable for analytical workloads at scale. They come with compression capabilities that optimize files for object storage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (styles.adoc) - The number '2' should be spelled out as 'two' according to the style guide rule for numbers between one and nine.

Proposed fix

Suggested change
* Data layer: Stores the data in data files. The Iceberg integration currently supports the Parquet file format. Parquet files are column-based and suitable for analytical workloads at scale. They come with compression capabilities that optimize files for object storage.
* Data layer: Stores the data in data files. The Iceberg integration currently supports the Parquet file format. Parquet files are column-based and suitable for analytical workloads at scale. They come with compression capabilities that optimize files for object storage.

The review process flagged an issue with the number '2', but there is no such number in the text. This appears to be a false positive. No changes are needed to comply with the style guide in this instance.

[,bash,]
----
# Create new topic with five topic partitions, replication factor 1, and custom table partitioning for Iceberg
rpk topic create <new-topic-name> -p5 -r1 -c "redpanda.iceberg.partition.spec=(<partition-key1>, <partition-key2>, ...)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Custom Style Guide (styles.adoc) - The number '1' should be spelled out as 'one' according to the style guide rule for numbers between one and nine.

Proposed fix

Suggested change
rpk topic create <new-topic-name> -p5 -r1 -c "redpanda.iceberg.partition.spec=(<partition-key1>, <partition-key2>, ...)"
rpk topic create <new-topic-name> -p5 -rone -c "redpanda.iceberg.partition.spec=(<partition-key1>, <partition-key2>, ...)"

The number '1' in the command line argument '-r1' should be spelled out as 'one' to comply with the style guide. However, this change might affect the command's functionality if the command-line interface does not recognize spelled-out numbers. It's important to verify whether the command-line tool accepts 'one' instead of '1'. If not, this might be a false positive from the style guide review process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant