docs: create table update syntax (part 2)

kay-kim · kay-kim · commit b0ec36d11788 · 2025-05-13T11:13:18.000-04:00
diff --git a/doc/user/config.toml b/doc/user/config.toml
@@ -198,6 +198,11 @@ weight = 30
 # allow <a name="link-target">, the old syntax no longer works
 unsafe = true
 
+[markup]
+  [markup.highlight]
+    noClasses = false
+    style = "monokai"
+
 [[deployment.targets]]
 name = "production"
 url = "s3://materialize-website?region=us-east-1"
diff --git a/doc/user/content/sql/create-table.md b/doc/user/content/sql/create-table.md
@@ -1,6 +1,6 @@
 ---
 title: "CREATE TABLE"
-description: "`CREATE TABLE` creates a table that is persisted in durable storage."
+description: "Reference page for `CREATE TABLE`. `CREATE TABLE` creates a table that is persisted in durable storage."
 pagerank: 40
 menu:
   # This should also have a "non-content entry" under Reference, which is
@@ -186,6 +186,52 @@ See also [Materialize SQL data types](/sql/types/).
 
 {{</ tab >}}
 
+{{< tab "Source-populated tables (Kafka/Redpanda source)" >}}
+
+To create a table from a source, where the source maps to an external
+Kafka/Redpanda system:
+
+{{< note >}}
+
+Users cannot write to source-populated tables; i.e., users cannot perform
+[`INSERT`](/sql/insert/)/[`UPDATE`](/sql/update/)/[`DELETE`](/sql/delete/)
+operations on source-populated tables.
+
+{{</  note >}}
+
+```mzsql
+CREATE TABLE <table_name> FROM SOURCE <source_name> [(REFERENCE <ref_object>)]
+[FORMAT <format> | KEY FORMAT <format> VALUE FORMAT <format>]
+   -- <format> can be:
+   -- AVRO USING CONFLUENT SCHEMA REGISTRY CONNECTION <conn_name>
+   --     [KEY STRATEGY
+   --       INLINE <schema> | ID <schema_registry_id> | LATEST ]
+   --     [VALUE STRATEGY
+   --       INLINE <schema> | ID <schema_registry_id> | LATEST ]
+  -- | PROTOBUF USING CONFLUENT SCHEMA REGISTRY CONNECTION <conn_name>
+  -- | PROTOBUF MESSAGE <msg_name> USING SCHEMA <encoded_schema>
+  -- | CSV WITH HEADER ( <col_name>[, ...]) [DELIMITED BY <char>]
+  -- | CSV WITH <num> COLUMNS DELIMITED BY <char>
+  -- | JSON | TEXT | BYTES
+]
+[INCLUDE
+    KEY [AS <name>] | PARTITION [AS <name>] | OFFSET [AS <name>]
+  | TIMESTAMP [AS <name>] | HEADERS [AS <name>] | HEADER <key_name> AS <name> [BYTES]
+  [, ...]
+]
+[ENVELOPE
+    NONE  --  Default.  Uses the append-only envelope.
+  | DEBEZIUM
+  | UPSERT [(VALUE DECODING ERRORS = INLINE [AS name])]
+]
+;
+```
+
+{{% yaml-table data="syntax_options/create_table_options_source_populated_kafka"
+%}}
+
+
+{{</ tab >}}
 
 {{</ tabs >}}
 
diff --git a/doc/user/data/syntax_options/create_table_options_source_populated_kafka.yml b/doc/user/data/syntax_options/create_table_options_source_populated_kafka.yml
@@ -0,0 +1,87 @@
+columns:
+  - column: "Parameter"
+  - column: "Description"
+rows:
+  - "Parameter": "`<table_name>`"
+    "Description": |
+
+      The name of the table to create. Names for tables must follow the [naming
+      guidelines](/sql/identifiers/#naming-restrictions).
+
+  - "Parameter": "`<source_name>`"
+    "Description": |
+
+      The name of the [source](/sql/create-source/kafka/) created for the Kafka topic.
+
+  - "Parameter": "**(REFERENCE <ref_object>)**"
+    "Description": |
+
+      *Optional.* If specified, the topic (which should match the topic
+      specified in the source) from which to create the table. You can create
+      multiple tables from the same reference object.
+
+      To find the reference objects available in your
+      [source](/sql/create-source/), you can use the following query,
+      substituting your source name for `<source_name>`:
+
+      <br>
+
+      ```mzsql
+      SELECT refs.*
+      FROM mz_internal.mz_source_references refs, mz_sources s
+      WHERE s.name = '<source_name>' -- substitute with your source name
+      AND refs.source_id = s.id;
+      ```
+
+  - "Parameter": |
+      **FORMAT \<format\> |
+      KEY FORMAT \<format\> VALUE FORMAT \<format\>**
+    "Description": |
+
+      *Optional.* If specified, use the specified format to decode the data. The following `<format>`s are supported:
+
+      | Format | Description |
+      |--------|-------------|
+      | `AVRO USING CONFLUENT SCHEMA REGISTRY CONNECTION <csr_connection> [KEY STRATEGY <strategy> VALUE STRATEGY <strategy>]` | Decode the data as Avro, specifying the [Confluent Schema Registry connection](/sql/create-connection/#confluent-schema-registry) to use. You can also specify the `KEY STRATEGY` and `VALUE STRATEGY` to use: <table> <thead> <tr> <th>Strategy</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>LATEST</code></td> <td>(Default) Use the latest writer schema from the schema registry as the reader schema.</td> </tr> <tr> <td><code>ID</code></td> <td>Use a specific schema from the registry.</td> </tr> <tr> <td><code>INLINE</code></td> <td>Use the inline schema.</td> </tr> </tbody> </table>|
+      | `PROTOBUF USING CONFLUENT SCHEMA REGISTRY CONNECTION <csr_connection>` | Decode the data as Protocol Buffers, specifying the [Confluent Schema Registry connection](/sql/create-connection/#confluent-schema-registry) to use. |
+      | `PROTOBUF MESSAGE <msg_name> USING SCHEMA <encoded_schema>` | Decode the data as Protocol Buffers, specifying the `<msg_name>` and the inline `<encoded_schema>` descriptor to use. |
+      | `JSON` | Decode the data as JSON. |
+      | `TEXT` | Decode the data as TEXT. |
+      | `BYTES` | Decode the data as BYTES. |
+      | `CSV WITH HEADER ( <col_name>[, ...]) [DELIMITED BY <char>]` | Parse the data as CSV with a header row. Materialize uses this header to infer both the number of columns and their names. The header is **not** ingested as data. The optional `DELIMITED BY <char>` clause specifies the delimiter character. <br><br>The data is decoded as [`text`](/sql/types/text). You can convert the data to other to other types using explicit [casts](/sql/functions/cast/) when creating views.|
+      | `CSV WITH <num> COLUMNS DELIMITED BY <char>` | Parse the data as CSV with a specified number of columns and a specified delimiter. The columns are named `column1`, `column2`...`columnN`. <br><br> The data is decoded as [`text`](/sql/types/text). You can convert the data to other to other types using explicit [casts](/sql/functions/cast/) when creating views.|
+
+      {{< include-md file="shared-content/kafka-format-envelope-compat-table.md"
+      >}}
+
+      For more information, see [Creating a source](/sql/create-source/kafka/#creating-a-source).
+
+  - "Parameter": |
+      **INCLUDE \<include_option\>**
+    "Description": |
+
+      *Optional.* If specified, include the additional information as column(s) in the table. The following `<include_option>`s are supported:
+
+      | Option | Description |
+      |--------|-------------|
+      | **KEY [AS \<name\>]** | Include a column containing the Kafka message key. If the key is encoded using a format that includes schemas the column will take its name from the schema. For unnamed formats (e.g. `TEXT`), the column will be named `key`. The column can be renamed with the optional **AS** *name* statement.
+      | **PARTITION [AS \<name\>]** | Include a `partition` column containing the Kafka message partition. The column can be renamed with the optional **AS** *name* clause.
+      | **OFFSET [AS \<name\>]** | Include an `offset` column containing the Kafka message offset. The column can be renamed with the optional **AS** *name* clause.
+      | **TIMESTAMP [AS \<name\>]** | Include a `timestamp` column containing the Kafka message timestamp. The column can be renamed with the optional **AS** *name* clause. <br><br>Note that the timestamp of a Kafka message depends on how the topic and its producers are configured. See the [Confluent documentation](https://docs.confluent.io/3.0.0/streams/concepts.html?#time) for details.
+      | **HEADERS [AS \<name\>]** | Include a `headers` column containing the Kafka message headers as a list of records of type `(key text, value bytea)`. The column can be renamed with the optional **AS** *name* clause.
+      | **HEADER \<key\> AS \<name\> [**BYTES**]** | Include a *name* column containing the Kafka message header *key* parsed as a UTF-8 string. To expose the header value as `bytea`, use the `BYTES` option.
+
+  - "Parameter": |
+      **ENVELOPE \<envelope\>**
+    "Description": |
+
+      *Optional.* If specified, use the specified envelope. The following `<envelope>`s are supported:
+
+      | Envelope | Description |
+      |----------|-------------|
+      | **ENVELOPE NONE** | *Default*. Use an append-only envelope. This means that records will only be appended and cannot be updated or deleted.
+      | **ENVELOPE DEBEZIUM** | Use the Debezium envelope, which uses a diff  envelope to handle CRUD operations. This envelope can lead to **high memory utilization** in the cluster maintaining the source. Materialize can automatically offload processing to disk as needed. See [spilling to disk](/sql/create-source/kafka/#spilling-to-disk) for details. For more information, see [Using Debezium](/sql/create-source/kafka/#using-debezium).
+      | **ENVELOPE UPSERT** [**(VALUE DECODING ERRORS = INLINE)**] | Use the upsert envelope, which uses message keys to handle CRUD operations. To handle value decoding errors, use the `(VALUE DECODING ERRORS = INLINE)` option. For more information, see [Handling upserts](/sql/create-source/kafka/#handling-upserts) and [Value decoding errors](/sql/create-source/kafka/#value-decoding-errors).
+
+      {{< include-md file="shared-content/kafka-format-envelope-compat-table.md" >}}
+
diff --git a/doc/user/shared-content/kafka-format-envelope-compat-table.md b/doc/user/shared-content/kafka-format-envelope-compat-table.md
@@ -0,0 +1,8 @@
+The following table specifies the format and envelope compatibility:
+
+| Format | Append-only envelope | Upsert envelope | Debezium envelope |
+|--------|:--------------------:|:---------------:|:-----------------:|
+| Avro              | ✓         | ✓               | ✓                 |
+| Protobuf          | ✓         | ✓
+| JSON/Text/Bytes   | ✓         | ✓
+| CSV               | ✓         |                 |