KafkaTopic replica increase stuck partially complete? #11256
Replies: 4 comments 3 replies
-
It would be useful to have the topic operator log as well as the cruise control log to find out what's happening there. FYI @fvaleri |
Beta Was this translation helpful? Give feedback.
-
Hi, apologies this was right at the end of my day and didn't have time to dig out all the full details! We have fixed the immediate issue by manually using kafka-reassign-partitions.sh to remove the extra replicas and force a reconciliation of the KafkaTopic resource which then kicked off a Cruise Control job. However I still have a couple that are broken to investigate so... Current state of the KafkaTopic resource
Relevant Operator logsI've removed references to other Kafka clusters in the same Kubernetes cluster and some repeated podset reconciliation messages
Topic Operator Logs
The Cruise Control logs are very noisy because of the constant output from metric collecting.
And I think this is the relevant task though Cruise Control logs
I did at some point also have to re-deploy cruise control with custom goals as some of the proposals were being rejected due to CPU on a broker being at about 15%... So i re-deployed with just RackAwareGoal, ReplicaCapacityGoal and DiskCapacityGoal |
Beta Was this translation helpful? Give feedback.
-
@hamishforbes CC logs are incomplete, so it's hard to correlate. Checking the TO logs, we can analyze the messages related to the KafkaTopic named
In short, it is the expected behavior. I think you had multiple CC restarts that made it difficult to complete some replicas changes. It would be interesting to check the full CC logs to see if there is any unexpected error on that side, or if the corresponding task was marked as completed when not. Additionally, if you are able to reproduce the issue, it would be good to enable CC client trace logging and capture the full TO and CC logs, so that we can easily correlate using spec:
entityOperator:
topicOperator:
logging:
type: inline
loggers:
logger.ccc.name: io.strimzi.operator.topic.cruisecontrol.CruiseControlClientImpl
logger.ccc.level: TRACE Hope it helps. |
Beta Was this translation helpful? Give feedback.
-
Thanks for sharing CC logs.
Looking at the topic named # CRUISE CONTROL
2025-03-12 01:22:55 INFO UserTaskManager:260 - Create a new UserTask 3dd86c42-804d-4903-ba13-cf8da85df586 with SessionKey SessionKey{_session=com.linkedin.kafka.cruisecontrol.servlet.ServletSession@64b82a10,_requestUrl=POST /kafkacruisecontrol/topic_configuration,_queryParams={dryrun=[false], json=[true], skip_rack_awareness_check=[true]}}
...
2025-03-12 01:25:47 INFO Executor:1384 - User task 3dd86c42-804d-4903-ba13-cf8da85df586: Execution succeeded: removed brokers: null; demoted brokers: null; total time used: 171263ms.
# TOPIC OPERATOR
2025-03-12 01:22:55 INFO CruiseControlHandler:160 - Replicas change pending, Topics: [search.core_organisations, search.core_groups, debezium-offsets, blink.flows.schema-changes, blink.core.schema-changes, blink.surveys.db-update-events, search.core_messages, debezium.heartbeat.blink.core, search.core_attachments, search.core_folders, search.core_analytics_feed_events, groups-membership-processing-events, search.documents.processing, m365-events-topic, search.core_users, blink.surveys.schema-changes, blink.noop, search.core_feed_events, search.core_analytics_feed_event_likes]
...
2025-03-12 01:26:53 INFO CruiseControlHandler:190 - Replicas change completed, Topics: [search.core_users, blink.surveys.schema-changes, debezium-offsets, groups-membership-processing-events, search.core_analytics_feed_events, blink.surveys.db-update-events, search.core_messages, search.documents.processing, debezium.heartbeat.blink.core, search.core_folders, m365-events-topic, blink.core.schema-changes, search.core_attachments, search.core_groups, search.core_organisations, blink.flows.schema-changes] If we take # KAFKA
$ kafka-topics.sh $KAFKA_ARGS --describe --topic search.core_folders
Topic: search.core_folders TopicId: 7gMTZZsRScCUBePMCkKb0Q PartitionCount: 9 ReplicationFactor: 2 Configs: min.insync.replicas=1,flush.messages=1000,retention.bytes=1073741824
Topic: search.core_folders Partition: 0 Leader: 3 Replicas: 3,4 Isr: 4,3
Topic: search.core_folders Partition: 1 Leader: 4 Replicas: 4,3,5 Isr: 4,5,3
Topic: search.core_folders Partition: 2 Leader: 4 Replicas: 4,3 Isr: 4,3
Topic: search.core_folders Partition: 3 Leader: 3 Replicas: 3,4 Isr: 4,3
Topic: search.core_folders Partition: 4 Leader: 4 Replicas: 4,3,5 Isr: 4,5,3
Topic: search.core_folders Partition: 5 Leader: 3 Replicas: 3,4 Isr: 4,3
Topic: search.core_folders Partition: 6 Leader: 3 Replicas: 3,4 Isr: 4,3
Topic: search.core_folders Partition: 7 Leader: 4 Replicas: 4,3 Isr: 4,3
Topic: search.core_folders Partition: 8 Leader: 4 Replicas: 4,3 Isr: 4,3
# CRUISE CONTROL
2025-03-12 01:23:41 INFO Executor:1892 - User task 3dd86c42-804d-4903-ba13-cf8da85df586: Finished tasks: [{EXE_ID: 80, INTER_BROKER_REPLICA_ACTION, {__strimzi-topic-operator-kstreams-topic-store-changelog-0, oldLeader: 4, [4, 3] -> [3, 4]}, COMPLETED}, {EXE_ID: 25, INTER_BROKER_REPLICA_ACTION, {search.core_folders-2, oldLeader: 4, [4, 3] -> [4, 3, 5]}, COMPLETED}, {EXE_ID: 26, INTER_BROKER_REPLICA_ACTION, {search.documents.processing-1, oldLeader: 4, [4, 3] -> [4, 3, 5]}, COMPLETED}, {EXE_ID: 27, INTER_BROKER_REPLICA_ACTION, {kafka-monitoring-8, oldLeader: 5, [5, 3] -> [3, 5]}, COMPLETED}, {EXE_ID: 28, INTER_BROKER_REPLICA_ACTION, {search.core_messages-8, oldLeader: 4, [4, 3] -> [3, 4, 5]}, COMPLETED}, {EXE_ID: 29, INTER_BROKER_REPLICA_ACTION, {search.documents.processing-0, oldLeader: 4, [3, 4] -> [3, 4, 5]}, COMPLETED}]. Can you send the updated KafkaTopic with status and kafka-topics.sh output for |
Beta Was this translation helpful? Give feedback.
-
I'm using KafkaTopic resources and cruise control to try and add a replica to a topic (in fact to several topics, most of which worked).
But they've got stuck halfway and the topic operator doesn't seem to be trying to make progress at all.
So I started with a topic set to
partitions: 9
andreplicas: 2
.I enabled cruise control and then changed the replicas field to
3
.Out of ~30 topics 3 have got stuck with a couple of partitions having 3 replicas and the rest having 2
e.g.
I've tried flipping back to 2 replicas, down to 1, up to 4, but from what I can see the topic operator is not submitting anything to cruise control to make changes, although it is making other topic config changes (e.g. retention or min isr).
Any idea how I can get this unstuck?
Beta Was this translation helpful? Give feedback.
All reactions