Skip to content

Race condition in TestPartitionReader_ConsumeAtStartup #10686

Open
@charleskorn

Description

@charleskorn

Failed in this CI run:

==================
WARNING: DATA RACE
Read at 0x00c004f1f3c3 by goroutine 199722:
  testing.(*common).logDepth()
      /usr/local/go/src/testing/testing.go:1024 +0x504
  testing.(*common).log()
      /usr/local/go/src/testing/testing.go:1011 +0x7d
  testing.(*common).Log()
      /usr/local/go/src/testing/testing.go:1052 +0x55
  testing.(*T).Log()
      <autogenerated>:1 +0x4f
  github.com/grafana/mimir/pkg/util/test.(*TestingLogger).Log()
      /__w/mimir/mimir/pkg/util/test/logger.go:38 +0x177
  github.com/go-kit/log.(*context).Log()
      /__w/mimir/mimir/vendor/github.com/go-kit/log/log.go:168 +0x4ba
  github.com/grafana/mimir/pkg/storage/ingest.(*KafkaLogger).Log()
      /__w/mimir/mimir/pkg/storage/ingest/logger.go:34 +0x6a9
  github.com/twmb/franz-go/pkg/kgo.(*wrappedLogger).Log()
      /__w/mimir/mimir/vendor/github.com/twmb/franz-go/pkg/kgo/logger.go:123 +0xc6
  github.com/twmb/franz-go/pkg/kgo.(*consumer).assignPartitions()
      /__w/mimir/mimir/vendor/github.com/twmb/franz-go/pkg/kgo/consumer.go:976 +0x2ec
  github.com/twmb/franz-go/pkg/kgo.(*consumer).doOnMetadataUpdate.func1()
      /__w/mimir/mimir/vendor/github.com/twmb/franz-go/pkg/kgo/consumer.go:1208 +0x173
  github.com/twmb/franz-go/pkg/kgo.(*consumer).doOnMetadataUpdate.func2()
      /__w/mimir/mimir/vendor/github.com/twmb/franz-go/pkg/kgo/consumer.go:1220 +0x3a

Previous write at 0x00c004f1f3c3 by goroutine 184514:
  testing.tRunner.func1()
      /usr/local/go/src/testing/testing.go:1677 +0x8fa
  runtime.deferreturn()
      /usr/local/go/src/runtime/panic.go:605 +0x5d
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1743 +0x44

Goroutine 199722 (running) created at:
  github.com/twmb/franz-go/pkg/kgo.(*consumer).doOnMetadataUpdate()
      /__w/mimir/mimir/vendor/github.com/twmb/franz-go/pkg/kgo/consumer.go:1217 +0x187
  github.com/twmb/franz-go/pkg/kgo.(*Client).updateMetadataLoop()
      /__w/mimir/mimir/vendor/github.com/twmb/franz-go/pkg/kgo/metadata.go:269 +0x10b4
  github.com/twmb/franz-go/pkg/kgo.NewClient.gowrap1()
      /__w/mimir/mimir/vendor/github.com/twmb/franz-go/pkg/kgo/client.go:516 +0x33

Goroutine 184514 (finished) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1743 +0x825
  github.com/grafana/mimir/pkg/storage/ingest.TestPartitionReader_ConsumeAtStartup.func13()
      /__w/mimir/mimir/pkg/storage/ingest/reader_test.go:1584 +0x7c
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1743 +0x44
==================
--- FAIL: TestPartitionReader_ConsumeAtStartup (0.00s)
    --- FAIL: TestPartitionReader_ConsumeAtStartup/should_consume_partition_from_start_if_position=start,_and_wait_until_target_lag_is_honored (0.00s)
        --- FAIL: TestPartitionReader_ConsumeAtStartup/should_consume_partition_from_start_if_position=start,_and_wait_until_target_lag_is_honored/without_concurrency (3.73s)
            logger.go:38: 2025-02-19 03:22:24.780026811 +0000 UTC m=+39.643234918 level info component kafka_client msg immediate metadata update triggered why forced load because we are producing to a topic for the first time
            logger.go:38: 2025-02-19 03:22:24.780104696 +0000 UTC m=+39.643312773 level info component kafka_client msg producing to a new topic for the first time, fetching metadata to learn its partitions topic test
            logger.go:38: 2025-02-19 03:22:24.78108563 +0000 UTC m=+39.644293707 level info component kafka_client msg done waiting for metadata for new topic topic test
            logger.go:38: 2025-02-19 03:22:24.781151743 +0000 UTC m=+39.644359820 level info component kafka_client msg initializing producer id
            logger.go:38: 2025-02-19 03:22:24.7817063 +0000 UTC m=+39.644914377 level info component kafka_client msg producer id initialization success id 8850898308766554034 epoch 0
            reader_test.go:[92](https://github.com/grafana/mimir/actions/runs/13404725165/job/37442517090#step:8:93)9: produced 2 records
            --- FAIL: TestPartitionReader_ConsumeAtStartup/should_consume_partition_from_start_if_position=start,_and_wait_until_target_lag_is_honored/without_concurrency/Run_2 (2.09s)
                testing.go:1399: race detected during execution of test
    --- FAIL: TestPartitionReader_ConsumeAtStartup/should_not_wait_indefinitely_if_there_are_no_records_to_consume_from_Kafka_but_partition_start_offset_is_>_0_(e.g._all_previous_records_have_been_deleted_by_Kafka_retention) (0.00s)
        logger.go:38: 2025-02-19 03:22:28.208577546 +0000 UTC m=+43.071785622 level info partition 1 component kafka_client msg assigning partitions why new assignments from direct consumer how assigning everything new, keeping current assignment input test[1{2 e-1 ce0}]
        --- FAIL: TestPartitionReader_ConsumeAtStartup/should_not_wait_indefinitely_if_there_are_no_records_to_consume_from_Kafka_but_partition_start_offset_is_>_0_(e.g._all_previous_records_have_been_deleted_by_Kafka_retention)/consume_from_position:_timestamp (0.00s)
            --- FAIL: TestPartitionReader_ConsumeAtStartup/should_not_wait_indefinitely_if_there_are_no_records_to_consume_from_Kafka_but_partition_start_offset_is_>_0_(e.g._all_previous_records_have_been_deleted_by_Kafka_retention)/consume_from_position:_timestamp/with_startup_concurrency (1.03s)
                logger.go:38: 2025-02-19 03:22:27.420871761 +0000 UTC m=+42.284079828 level info component kafka_client msg producing to a new topic for the first time, fetching metadata to learn its partitions topic test
                logger.go:38: 2025-02-19 03:22:27.420953454 +0000 UTC m=+42.284161531 level info component kafka_client msg immediate metadata update triggered why forced load because we are producing to a topic for the first time
                logger.go:38: 2025-02-19 03:22:27.421818651 +0000 UTC m=+42.285026728 level info component kafka_client msg done waiting for metadata for new topic topic test
                logger.go:38: 2025-02-19 03:22:27.421878332 +0000 UTC m=+42.285086409 level info component kafka_client msg initializing producer id
                logger.go:38: 2025-02-19 03:22:27.422574032 +0000 UTC m=+42.285782110 level info component kafka_client msg producer id initialization success id 1464923233257831146 epoch 0
                reader_test.go:1617: produced 2 records
                reader_test.go:1626: fetched partition end offset: 2
                reader_test.go:1634: advanced partition start offset to: 2
                logger.go:38: 2025-02-19 03:22:27.426967154 +0000 UTC m=+42.290175231 level info partition 1 task autocreate_topic msg topic already exists topic test num_partitions -1 replication_factor -1
                logger.go:38: 2025-02-19 03:22:27.431750625 +0000 UTC m=+42.294958702 level info partition 1 msg starting consumption from timestamp timestamp 1739[93](https://github.com/grafana/mimir/actions/runs/13404725165/job/37442517090#step:8:94)5347425 last_consumed_offset 1 start_offset 2 consumer_group test-group
                logger.go:38: 2025-02-19 03:22:27.436095115 +0000 UTC m=+42.299303223 level info partition 1 component kafka_client msg immediate metadata update triggered why from AddConsumePartitions
                logger.go:38: 2025-02-19 03:22:27.436641817 +0000 UTC m=+42.2998498[94](https://github.com/grafana/mimir/actions/runs/13404725165/job/37442517090#step:8:95) level info partition 1 component kafka_client msg assigning partitions why new assignments from direct consumer how assigning everything new, keeping current assignment input test[1{2 e-1 ce0}]
                logger.go:38: 2025-02-19 03:22:27.436815081 +0000 UTC m=+42.300023158 level info partition 1 target_lag 1s max_lag 1s msg partition reader is starting to consume partition until target and max consumer lag is honored
                logger.go:38: 2025-02-19 03:22:27.436986451 +0000 UTC m=+42.300194528 level info partition 1 msg starting concurrent fetchers start_offset 2 concurrency 2 bytes_per_fetch_request 50000000
                logger.go:38: 2025-02-19 03:22:28.438383733 +0000 UTC m=+43.301591810 level info partition 1 target_lag 1s max_lag 1s msg partition reader found no records to consume because partition is empty partition_start_offset 2 last_produced_offset 1
                logger.go:38: 2025-02-19 03:22:28.438469042 +0000 UTC m=+43.301677109 level info partition 1 target_lag 1s max_lag 1s msg partition reader consumed partition and current lag is lower than configured target consumer lag last_consumed_offset 1 current_lag 0s
                logger.go:38: 2025-02-19 03:22:28.440247727 +0000 UTC m=+43.303455804 partition 1 fetcher 1 level debug msg fetched records duration 1.003171678s start_offset 2 end_offset 5002 asked_records 5000 got_records 0 diff_records 5000 asked_bytes 52500000 got_bytes 0 diff_bytes 52500000 first_timestamp  last_timestamp  hwm 0 lso 0 err <nil>
                logger.go:38: 2025-02-19 03:22:28.440347483 +0000 UTC m=+43.303555560 level info partition 1 msg stopping partition reader
                logger.go:38: 2025-02-19 03:22:28.440546285 +0000 UTC m=+43.303754352 level info partition 1 msg stopped concurrent fetchers last_returned_offset 1
                testing.go:1399: race detected during execution of test
        --- FAIL: TestPartitionReader_ConsumeAtStartup/should_not_wait_indefinitely_if_there_are_no_records_to_consume_from_Kafka_but_partition_start_offset_is_>_0_(e.g._all_previous_records_have_been_deleted_by_Kafka_retention)/consume_from_position:_end (0.00s)
            --- FAIL: TestPartitionReader_ConsumeAtStartup/should_not_wait_indefinitely_if_there_are_no_records_to_consume_from_Kafka_but_partition_start_offset_is_>_0_(e.g._all_previous_records_have_been_deleted_by_Kafka_retention)/consume_from_position:_end/with_startup_concurrency (2.03s)
                logger.go:38: 2025-02-19 03:22:27.487145605 +0000 UTC m=+42.350353682 level info component kafka_client msg producing to a new topic for the first time, fetching metadata to learn its partitions topic test
                logger.go:38: 2025-02-19 03:22:27.487248488 +0000 UTC m=+42.350456564 level info component kafka_client msg immediate metadata update triggered why forced load because we are producing to a topic for the first time
                logger.go:38: 2025-02-19 03:22:27.488230353 +0000 UTC m=+42.351438440 level info component kafka_client msg done waiting for metadata for new topic topic test
                logger.go:38: 2025-02-19 03:22:27.488299081 +0000 UTC m=+42.351507158 level info component kafka_client msg initializing producer id
                logger.go:38: 2025-02-19 03:22:27.488[95](https://github.com/grafana/mimir/actions/runs/13404725165/job/37442517090#step:8:96)2653 +0000 UTC m=+42.352160730 level info component kafka_client msg producer id initialization success id 83262545095[96](https://github.com/grafana/mimir/actions/runs/13404725165/job/37442517090#step:8:97)344351 epoch 0
                reader_test.go:1617: produced 2 records
                reader_test.go:1626: fetched partition end offset: 2
                reader_test.go:1634: advanced partition start offset to: 2
                logger.go:38: 2025-02-19 03:22:27.495512685 +0000 UTC m=+42.358720752 level info partition 1 task autocreate_topic msg topic already exists topic test num_partitions -1 replication_factor -1
                logger.go:38: 2025-02-19 03:22:27.498719258 +0000 UTC m=+42.361927335 level info partition 1 msg starting consumption from timestamp timestamp 1739935347491 last_consumed_offset 1 start_offset 2 consumer_group test-group
                logger.go:38: 2025-02-19 03:22:27.508442422 +0000 UTC m=+42.371650499 level info partition 1 component kafka_client msg immediate metadata update triggered why from AddConsumePartitions
                logger.go:38: 2025-02-19 03:22:27.508942006 +0000 UTC m=+42.372150083 level info partition 1 component kafka_client msg assigning partitions why new assignments from direct consumer how assigning everything new, keeping current assignment input test[1{2 e-1 ce0}]
                logger.go:38: 2025-02-19 03:22:27.511696013 +0000 UTC m=+42.374904090 level info partition 1 target_lag 1s max_lag 1s msg partition reader is starting to consume partition until target and max consumer lag is honored
                logger.go:38: 2025-02-19 03:22:27.511891448 +0000 UTC m=+42.375099526 level info partition 1 msg starting concurrent fetchers start_offset 2 concurrency 2 bytes_per_fetch_request 50000000
                logger.go:38: 2025-02-19 03:22:28.512751706 +0000 UTC m=+43.37595[97](https://github.com/grafana/mimir/actions/runs/13404725165/job/37442517090#step:8:98)73 partition 1 fetcher 1 level debug msg received an error while fetching records; will retry after processing received records (if any) duration 1.000777823s start_offset 2 end_offset 5002 asked_records 5000 got_records 0 diff_records 5000 asked_bytes 52500000 got_bytes 0 diff_bytes 52500000 first_timestamp  last_timestamp  hwm 0 lso 0 err fetch request failed with error: OFFSET_OUT_OF_RANGE: The requested offset is not within the range of offsets maintained by the server.
                logger.go:38: 2025-02-19 03:22:28.512832477 +0000 UTC m=+43.376040554 partition 1 fetcher 1 method concurrentFetcher.fetch.attempt level debug log_start_offset 2 start_offset 2 end_offset 5002 msg offset out of range; waiting for new records to be produced
                logger.go:38: 2025-02-19 03:22:29.514005638 +0000 UTC m=+44.377213715 partition 1 fetcher 1 level debug msg received an error while fetching records; will retry after processing received records (if any) duration 1.001120803s start_offset 2 end_offset 5002 asked_records 5000 got_records 0 diff_records 5000 asked_bytes 52500000 got_bytes 0 diff_bytes 52500000 first_timestamp  last_timestamp  hwm 0 lso 0 err fetch request failed with error: OFFSET_OUT_OF_RANGE: The requested offset is not within the range of offsets maintained by the server.
                logger.go:38: 2025-02-19 03:22:29.514054167 +0000 UTC m=+44.377262244 level info partition 1 target_lag 1s max_lag 1s msg partition reader found no records to consume because partition is empty partition_start_offset 2 last_produced_offset 1
                logger.go:38: 2025-02-19 03:22:29.514123057 +0000 UTC m=+44.377331134 partition 1 fetcher 1 method concurrentFetcher.fetch.attempt level debug log_start_offset 2 start_offset 2 end_offset 5002 msg offset out of range; waiting for new records to be produced
                logger.go:38: 2025-02-19 03:22:29.514190221 +0000 UTC m=+44.3773[98](https://github.com/grafana/mimir/actions/runs/13404725165/job/37442517090#step:8:99)298 level info partition 1 target_lag 1s max_lag 1s msg partition reader consumed partition and current lag is lower than configured target consumer lag last_consumed_offset 1 current_lag 0s
                logger.go:38: 2025-02-19 03:22:29.515581612 +0000 UTC m=+44.378789679 partition 1 fetcher 1 level debug msg fetched records duration 1.295961ms start_offset 2 end_offset 5002 asked_records 5000 got_records 0 diff_records 5000 asked_bytes 52500000 got_bytes 0 diff_bytes 52500000 first_timestamp  last_timestamp  hwm 0 lso 0 err <nil>
                logger.go:38: 2025-02-19 03:22:29.515594658 +0000 UTC m=+44.378802735 level info partition 1 msg stopping partition reader
                logger.go:38: 2025-02-19 03:22:29.515786775 +0000 UTC m=+44.378[99](https://github.com/grafana/mimir/actions/runs/13404725165/job/37442517090#step:8:100)4902 level info partition 1 msg stopped concurrent fetchers last_returned_offset 1
                testing.go:1399: race detected during execution of test
    --- FAIL: TestPartitionReader_ConsumeAtStartup/should_consume_partition_from_start_if_last_committed_offset_is_missing_and_wait_until_target_lag_is_honored (0.00s)
        --- FAIL: TestPartitionReader_ConsumeAtStartup/should_consume_partition_from_start_if_last_committed_offset_is_missing_and_wait_until_target_lag_is_honored/with_startup_concurrency (0.11s)
            logger.go:38: 2025-02-19 03:22:28.2[106](https://github.com/grafana/mimir/actions/runs/13404725165/job/37442517090#step:8:107)88311 +0000 UTC m=+43.073896388 level info component kafka_client msg producing to a new topic for the first time, fetching metadata to learn its partitions topic test
            logger.go:38: 2025-02-19 03:22:28.2[107](https://github.com/grafana/mimir/actions/runs/13404725165/job/37442517090#step:8:108)58432 +0000 UTC m=+43.073966509 level info component kafka_client msg immediate metadata update triggered why forced load because we are producing to a topic for the first time
            logger.go:38: 2025-02-19 03:22:28.21152789 +0000 UTC m=+43.074735967 level info component kafka_client msg initializing producer id
            logger.go:38: 2025-02-19 03:22:28.211631394 +0000 UTC m=+43.074839470 level info component kafka_client msg done waiting for metadata for new topic topic test
            logger.go:38: 2025-02-19 03:22:28.212179077 +0000 UTC m=+43.075387154 level info component kafka_client msg producer id initialization success id 2470798140527403964 epoch 0
            reader_test.go:639: produced 2 records
            testing.go:1399: race detected during execution of test
    --- FAIL: TestPartitionReader_ConsumeAtStartup/should_consume_partition_from_last_committed_offset_if_position=last-offset,_and_wait_until_target_lag_is_honored (0.00s)
        testing.go:1399: race detected during execution of test
FAIL
FAIL	github.com/grafana/mimir/pkg/storage/ingest	93.843s

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions