Skip to content

Commit dce02ab

Browse files
authored
Merge branch 'beta' into Doc-936
2 parents e3f177c + 5af796c commit dce02ab

File tree

8 files changed

+377
-23
lines changed

8 files changed

+377
-23
lines changed

modules/get-started/pages/whats-new.adoc

+22
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,22 @@ Redpanda now supports normalization of Protobuf schemas in the Schema Registry.
3939

4040
You now can configure Kafka clients to authenticate using xref:manage:security/authentication#enable-sasl.adoc[SASL/PLAIN] with a single account using the same username and password. Unlike SASL/SCRAM, which uses a challenge response with hashed credentials, SASL/PLAIN transmits plaintext passwords. You enable SASL/PLAIN by appending `PLAIN` to the list of SASL mechanisms.
4141

42+
== New metrics
43+
44+
The following metrics are new in this version:
45+
46+
=== Consumer lag gauges
47+
48+
Redpanda can now expose dedicated consumer lag gauges that eliminate the need to calculate lag manually. These metrics provide real-time insights into consumer group performance and help identify issues. The following metrics are available:
49+
50+
- xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_lag_max[`redpanda_kafka_consumer_group_lag_max`]:
51+
Reports the maximum lag observed among all partitions for a consumer group. This metric helps pinpoint the partition with the greatest delay, indicating potential performance or configuration issues.
52+
53+
- xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_lag_sum[`redpanda_kafka_consumer_group_lag_sum`]:
54+
Aggregates the lag across all partitions, providing an overall view of data consumption delay for the consumer group.
55+
56+
See xref:manage:monitoring.adoc#consumers[Monitor consumer group lag] for more information.
57+
4258
== New cluster properties
4359

4460
The following cluster properties are new in this version:
@@ -67,6 +83,12 @@ The following cluster properties are new in this version:
6783
- xref:reference:properties/cluster-properties.adoc#raft_max_buffered_follower_append_entries_bytes_per_shard[`raft_max_buffered_follower_append_entries_bytes_per_shard`]: Limits the maximum bytes buffered for follower append entries per shard.
6884
- xref:reference:properties/cluster-properties.adoc#raft_max_inflight_follower_append_entries_requests_per_shard[`raft_max_inflight_follower_append_entries_requests_per_shard`]: Replaces the deprecated `raft_max_concurrent_append_requests_per_follower` to limit in-flight follower append requests per shard.
6985

86+
=== Tiered Storage
87+
88+
- xref:reference:properties/object-storage-properties.adoc#cloud_storage_enable_remote_allow_gaps[`cloud_storage_enable_remote_allow_gaps`]: Controls the eviction of locally stored log segments when Tiered Storage uploads are paused.
89+
90+
- xref:reference:properties/object-storage-properties.adoc#cloud_storage_enable_segment_uploads[`cloud_storage_enable_segment_uploads`]: Controls the upload of log segments to Tiered Storage. If set to `false`, this property temporarily pauses all log segment uploads from the Redpanda cluster.
91+
7092
=== TLS configuration
7193

7294
- xref:reference:properties/cluster-properties.adoc#tls_enable_renegotiation[`tls_enable_renegotiation`]: Enables support for TLS renegotiation.

modules/manage/partials/monitor-health.adoc

+123-4
Original file line numberDiff line numberDiff line change
@@ -209,13 +209,132 @@ Leaderless partitions can be caused by unresponsive brokers. When an alert on `r
209209

210210
Redpanda's Raft implementation exchanges periodic status RPCs between a broker and its peers. The xref:reference:public-metrics-reference.adoc#redpanda_node_status_rpcs_timed_out[`redpanda_node_status_rpcs_timed_out`] gauge increases when a status RPC times out for a peer, which indicates that a peer may be unresponsive and may lead to problems with partition replication that Raft manages. Monitor for non-zero values of this gauge, and correlate it with any logged errors or changes in partition replication.
211211

212-
=== Consumers
212+
[[consumers]]
213+
=== Consumer group lag
213214

214-
==== Consumer group lag
215+
Consumer group lag is an important performance indicator that measures the difference between the broker's latest (max) offset and the consumer group's last committed offset. The lag indicates how current the consumed data is relative to real-time production. A high or increasing lag means that consumers are processing messages slower than producers are generating them. A decreasing or stable lag implies that consumers are keeping pace with producers, ensuring real-time or near-real-time data consumption.
215216

216-
When working with Kafka consumer groups, the consumer group lag—the difference between the broker's latest (max) offset and the group's last committed offset—is a performance indicator of how fresh the data being consumed is. While higher lag for archival consumers is expected, high lag for real-time consumers could indicate that the consumers are overloaded and thus may need their topics to be partitioned more, or to spread the load to more consumers.
217+
By monitoring consumer lag, you can identify performance bottlenecks and make informed decisions about scaling consumers, tuning configurations, and improving processing efficiency.
217218

218-
To monitor consumer group lag, create a query with the xref:reference:public-metrics-reference.adoc#redpanda_kafka_max_offset[`redpanda_kafka_max_offset`] and xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_committed_offset[`redpanda_kafka_consumer_group_committed_offset`] gauges:
219+
A high maximum lag may indicate that a consumer is experiencing connectivity problems or cannot keep up with the incoming workload.
220+
221+
A high or increasing total lag (lag sum) suggests that the consumer group lacks sufficient resources to process messages at the rate they are produced. In such cases, scaling the number of consumers within the group can help, but only up to the number of partitions available in the topic. If lag persists despite increasing consumers, repartitioning the topic may be necessary to distribute the workload more effectively and improve processing efficiency.
222+
223+
Redpanda provides the following methods for monitoring consumer group lag:
224+
225+
- <<dedicated-gauges, Dedicated gauges>>: Redpanda brokers can internally calculate consumer group lag and expose two dedicated gauges. This method is recommended for environments where your observability platform does not support complex queries required to calculate the lag from offset metrics.
226+
+
227+
Enabling these gauges may add a small amount of additional processing overhead to the brokers.
228+
- <<offset-based-calculation, Offset-based calculation>>: You can use your observability platform to calculate consumer group lag from offset metrics. Use this method if your observability platform supports functions, such as `max()`, and you prefer to avoid additional processing overhead on the broker.
229+
230+
==== Dedicated gauges
231+
232+
Redpanda can internally calculate consumer group lag and expose it as two dedicated gauges.
233+
234+
- xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_lag_max[`redpanda_kafka_consumer_group_lag_max`]:
235+
Reports the maximum lag observed among all partitions for a consumer group. This metric helps pinpoint the partition with the greatest delay, indicating potential performance or configuration issues.
236+
237+
- xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_lag_sum[`redpanda_kafka_consumer_group_lag_sum`]:
238+
Aggregates the lag across all partitions, providing an overall view of data consumption delay for the consumer group.
239+
240+
To enable these dedicated gauges, you must enable consumer group metrics in your cluster properties. Add the following settings to your Redpanda configuration:
241+
242+
- xref:reference:properties/cluster-properties.adoc#enable_consumer_group_metrics[`enable_consumer_group_metrics`]: A list of properties to enable for consumer group metrics. You must add the `consumer_lag` property to enable consumer group lag metrics.
243+
- xref:reference:properties/cluster-properties.adoc#consumer_group_lag_collection_interval_sec[`consumer_group_lag_collection_interval_sec`] (optional): The interval in seconds for collecting consumer group lag metrics. The default is 60 seconds.
244+
+
245+
Set this value equal to the scrape interval of your metrics collection system. Aligning these intervals ensures synchronized data collection, reducing the likelihood of missing or misaligned lag measurements.
246+
247+
For example:
248+
249+
ifndef::env-kubernetes[]
250+
[,bash]
251+
----
252+
rpk cluster config set enable_consumer_group_metrics '["group", "partition", "consumer_lag"]'
253+
----
254+
endif::[]
255+
256+
ifdef::env-kubernetes[]
257+
[tabs]
258+
======
259+
Helm + Operator::
260+
+
261+
--
262+
.`redpanda-cluster.yaml`
263+
[,yaml]
264+
----
265+
apiVersion: cluster.redpanda.com/v1alpha2
266+
kind: Redpanda
267+
metadata:
268+
name: redpanda
269+
spec:
270+
chartRef: {}
271+
clusterSpec:
272+
config:
273+
cluster:
274+
enable_consumer_group_metrics:
275+
- group
276+
- partition
277+
- consumer_lag
278+
----
279+
280+
```bash
281+
kubectl apply -f redpanda-cluster.yaml --namespace <namespace>
282+
```
283+
284+
--
285+
Helm::
286+
+
287+
--
288+
[tabs]
289+
====
290+
--values::
291+
+
292+
.`enable-consumer-metrics.yaml`
293+
[,yaml]
294+
----
295+
config:
296+
cluster:
297+
enable_consumer_group_metrics:
298+
- group
299+
- partition
300+
- consumer_lag
301+
----
302+
+
303+
```bash
304+
helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
305+
--values enable-consumer-metrics.yaml --reuse-values
306+
```
307+
308+
--set::
309+
+
310+
[,bash]
311+
----
312+
helm upgrade --install redpanda redpanda/redpanda \
313+
--namespace <namespace> \
314+
--create-namespace \
315+
--set config.cluster.enable_consumer_group_metrics[0]=group \
316+
--set config.cluster.enable_consumer_group_metrics[1]=partition \
317+
--set config.cluster.enable_consumer_group_metrics[2]=consumer_lag
318+
----
319+
320+
====
321+
--
322+
======
323+
endif::[]
324+
325+
326+
When these properties are enabled, Redpanda computes and exposes the `redpanda_kafka_consumer_group_lag_max` and `redpanda_kafka_consumer_group_lag_sum` gauges to the `/public_metrics` endpoint.
327+
328+
==== Offset-based calculation
329+
330+
If your environment is sensitive to the performance overhead of the <<dedicated-gauges, dedicated gauges>>, use the offset-based calculation method to calculate consumer group lag. This method requires your observability platform to support functions like `max()`.
331+
332+
Redpanda provides two metrics to calculate consumer group lag:
333+
334+
- xref:reference:public-metrics-reference.adoc#redpanda_kafka_max_offset[`redpanda_kafka_max_offset`]: The broker's latest offset for a partition.
335+
- xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_committed_offset[`redpanda_kafka_consumer_group_committed_offset`]: The last committed offset for a consumer group on that partition.
336+
337+
For example, here's a typical query to compute consumer lag:
219338

220339
[,promql]
221340
----

modules/reference/pages/properties/broker-properties.adoc

+20-6
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Specifies the TLS configuration for the HTTP Admin API.
3838

3939
*Visibility:* `user`
4040

41-
*Default:* `null`
41+
*Default:* `{}`
4242

4343
---
4444

@@ -177,7 +177,7 @@ Transport Layer Security (TLS) configuration for the Kafka API endpoint.
177177

178178
*Visibility:* `user`
179179

180-
*Default:* `null`
180+
*Default:* `{}`
181181

182182
---
183183

@@ -221,6 +221,18 @@ Broker IDs are immutable. After a broker joins the cluster, its `node_id` *canno
221221

222222
---
223223

224+
=== node_id_overrides
225+
226+
List of broker IDs and UUID to override at broker startup. Each entry includes the current UUID and desired ID and UUID. Each entry applies to a given broker only if 'current' matches that broker's current UUID.
227+
228+
*Visibility:* `user`
229+
230+
*Type:* array
231+
232+
*Default:* `{}`
233+
234+
---
235+
224236
=== openssl_config_file
225237

226238
Path to the configuration file used by OpenSSL to properly load the FIPS-compliant module.
@@ -307,7 +319,7 @@ The `seed_servers` list must be consistent across all seed brokers to prevent cl
307319

308320
*Type:* array
309321

310-
*Default:* `null`
322+
*Default:* `{}`
311323

312324
---
313325

@@ -373,7 +385,9 @@ For information on how to edit broker properties for the Schema Registry, see xr
373385

374386
=== api_doc_dir
375387

376-
API doc directory.
388+
Path to the API specifications for the HTTP Proxy API.
389+
390+
*Requires restart:* Yes
377391

378392
*Visibility:* `user`
379393

@@ -411,7 +425,7 @@ TLS configuration for Schema Registry API.
411425

412426
*Visibility:* `user`
413427

414-
*Default:* `null`
428+
*Default:* `{}`
415429

416430
---
417431

@@ -510,7 +524,7 @@ TLS configuration for Pandaproxy api.
510524

511525
*Visibility:* `user`
512526

513-
*Default:* `null`
527+
*Default:* `{}`
514528

515529
---
516530

modules/reference/pages/properties/cluster-properties.adoc

+94-1
Original file line numberDiff line numberDiff line change
@@ -414,6 +414,26 @@ This is an internal-only configuration and should be enabled only after consulti
414414

415415
---
416416

417+
=== consumer_group_lag_collection_interval_sec
418+
419+
How often to run the collection loop when <<enable_consumer_group_metrics,`enable_consumer_group_metrics`>> contains `consumer_lag`.
420+
421+
Reducing the value of `consumer_group_lag_collection_interval_sec` increases the metric collection frequency, which may raise resource utilization. In most environments, this impact is minimal, but it's best practice to monitor broker resource usage in high-scale settings.
422+
423+
*Unit:* seconds
424+
425+
*Requires restart:* No
426+
427+
*Visibility:* `tunable`
428+
429+
*Type:* integer
430+
431+
*Accepted values:* [`-17179869184`, `17179869183`]
432+
433+
*Default:* `60`
434+
435+
---
436+
417437
=== controller_backend_housekeeping_interval_ms
418438

419439
Interval between iterations of controller backend housekeeping loop.
@@ -812,6 +832,54 @@ Maximum amount of time the coordinator waits to snapshot after a command appears
812832

813833
---
814834

835+
=== datalake_scheduler_block_size_bytes
836+
837+
Size, in bytes, of each memory block reserved for record translation, as tracked by the datalake scheduler.
838+
839+
*Unit:* bytes
840+
841+
*Requires restart:* Yes
842+
843+
*Visibility:* `tunable`
844+
845+
*Type:* integer
846+
847+
*Default:* `4_mib`
848+
849+
---
850+
851+
=== datalake_scheduler_max_concurrent_translations
852+
853+
The maximum number of translations that the datalake scheduler will allow to run at a given time. If a translation is requested, but the number of running translations exceeds this value, the request will be put to sleep temporarily, polling until capacity becomes available.
854+
855+
*Requires restart:* Yes
856+
857+
*Visibility:* `tunable`
858+
859+
*Type:* integer
860+
861+
*Default:* `4`
862+
863+
---
864+
865+
=== datalake_scheduler_time_slice_ms
866+
867+
Time, in milliseconds, for a datalake translation as scheduled by the datalake scheduler. After a translation is scheduled, it will run until either the time specified has elapsed or all pending records on its source partition have been translated.
868+
869+
*Unit:* milliseconds
870+
871+
*Requires restart:* Yes
872+
873+
*Visibility:* `tunable`
874+
875+
*Type:* integer
876+
877+
*Accepted values:* [`-17592186044416`, `17592186044415`]
878+
879+
*Default:* `30000`
880+
881+
---
882+
815883
=== debug_bundle_auto_removal_seconds
816884

817885
If set, how long debug bundles are kept in the debug bundle storage directory after they are created. If not set, debug bundles are kept indefinitely.
@@ -1061,7 +1129,15 @@ Enables cluster metadata uploads. Required for xref:manage:whole-cluster-restore
10611129

10621130
=== enable_consumer_group_metrics
10631131

1064-
List of enabled consumer group metrics. Accepted Values: `group`, `partition`, `consumer_lag`.
1132+
List of enabled consumer group metrics. Accepted values include:
1133+
1134+
- `group`: Enables the xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_consumers[`redpanda_kafka_consumer_group_consumers`] and xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_topics[`redpanda_kafka_consumer_group_topics`] metrics.
1135+
- `partition`: Enables the xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_committed_offset[`redpanda_kafka_consumer_group_committed_offset`] metric.
1136+
- `consumer_lag`: Enables the xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_lag_max[`redpanda_kafka_consumer_group_lag_max`] and xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_lag_sum[`redpanda_kafka_consumer_group_lag_sum`] metrics
1137+
+
1138+
Enabling `consumer_lag` may add a small amount of additional processing overhead to the brokers, especially in environments with a high number of consumer groups or partitions.
1139+
+
1140+
Use the xref:reference:properties/cluster-properties.adoc#consumer_group_lag_collection_interval_sec[`consumer_group_lag_collection_interval_sec`] property to control the frequency of consumer lag metric collection.
10651141

10661142
*Requires restart:* No
10671143

@@ -1071,6 +1147,9 @@ List of enabled consumer group metrics. Accepted Values: `group`, `partition`, `
10711147

10721148
*Default:* `["group", "partition"]`
10731149

1150+
*Related topics*:
1151+
1152+
- xref:manage:monitoring.adoc#consumers[Monitor consumer group lag]
10741153
---
10751154

10761155
=== enable_controller_log_rate_limiting
@@ -1712,6 +1791,20 @@ Default value for the `redpanda.iceberg.delete` topic property that determines i
17121791

17131792
---
17141793

1794+
=== iceberg_disable_automatic_snapshot_expiry
1795+
1796+
Whether to disable automatic Iceberg snapshot expiry. This property may be useful if the Iceberg catalog expects to perform snapshot expiry on its own.
1797+
1798+
*Requires restart:* No
1799+
1800+
*Visibility:* `user`
1801+
1802+
*Type:* boolean
1803+
1804+
*Default:* `false`
1805+
1806+
---
1807+
17151808
=== iceberg_disable_snapshot_tagging
17161809

17171810
Whether to disable tagging of Iceberg snapshots. These tags are used to ensure that the snapshots that Redpanda writes are retained during snapshot removal, which in turn, helps Redpanda ensure exactly-once delivery of records. Disabling tags is therefore not recommended, but may be useful if the Iceberg catalog does not support tags.

0 commit comments

Comments
 (0)