Skip to content

compact: Block's checksum mismatched, but block seems doesn't broken #8611

@zayomeng

Description

@zayomeng

Environment

Thanos version: v0.38.0
Prometheus version: v3.0.1(using thanos sidecar mode with kube-prometheus)
Object storage: MinIO (Build RELEASE.2024-04-18)
Deployment: Kubernetes

compact start command args:
- compact
- '--wait'
- '--log.level=info'
- '--log.format=logfmt'
- '--http-address=0.0.0.0:10912'
- '--data-dir=/var/thanos/compactor'
- '--debug.accept-malformed-index'
- '--retention.resolution-raw=180d'
- '--retention.resolution-5m=180d'
- '--retention.resolution-1h=180d'
- '--delete-delay=6h'
- '--objstore.config-file=/config/thanos.yaml'
- '--compact.enable-vertical-compaction'
- '--deduplication.replica-label="prometheus_replica"'

Problem description

We are experiencing repeated compactor halts due to block corruption errors:

ts=2025-12-29T07:34:53.981145537Z caller=compact.go:1162 level=info group="0@{origin_prometheus=\"<ORIGIN_PROMETHEUS>\", prometheus=\"<PROM_NAMESPACE>\", prometheus_replica=\"<PROM_REPLICA>\"}" groupKey=0@<GROUP_KEY> msg="compaction available and planned" plan="[<BLOCK_A> (min time: <TS_A_START>, max time: <TS_A_END>) <BLOCK_B> (min time: <TS_B_START>, max time: <TS_B_END>) <BLOCK_C> (min time: <TS_C_START>, max time: <TS_C_END>) <BLOCK_D> (min time: <TS_D_START>, max time: <TS_D_END>)]"
ts=2025-12-29T07:34:53.981221777Z caller=compact.go:1171 level=info group="0@{origin_prometheus=\"<ORIGIN_PROMETHEUS>\", prometheus=\"<PROM_NAMESPACE>\", prometheus_replica=\"<PROM_REPLICA>\"}" groupKey=0@<GROUP_KEY> msg="finished running pre compaction callback; downloading blocks" duration=2.26µs duration_ms=0 plan="[<BLOCK_A> <BLOCK_B> <BLOCK_C> <BLOCK_D>]"
ts=2025-12-29T07:35:21.408310162Z caller=fetcher.go:627 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.962019566s duration_ms=1962 cached=<NUM> returned=<NUM> partial=<NUM>
ts=2025-12-29T07:35:54.052260608Z caller=compact.go:1229 level=info group="0@{origin_prometheus=\"<ORIGIN_PROMETHEUS>\", prometheus=\"<PROM_NAMESPACE>\", prometheus_replica=\"<PROM_REPLICA>\"}" groupKey=0@<GROUP_KEY> msg="downloaded and verified blocks; compacting blocks" duration=1m0.071023471s duration_ms=60071 plan="[/var/thanos/compactor/compact/0@<GROUP_KEY>/<BLOCK_A> /var/thanos/compactor/compact/0@<GROUP_KEY>/<BLOCK_B> /var/thanos/compactor/compact/0@<GROUP_KEY>/<BLOCK_C> /var/thanos/compactor/compact/0@<GROUP_KEY>/<BLOCK_D>]"
ts=2025-12-29T07:36:21.009665584Z caller=fetcher.go:627 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.562688996s duration_ms=1562 cached=<NUM> returned=<NUM> partial=<NUM>
ts=2025-12-29T07:37:20.529264978Z caller=compact.go:559 level=error msg="critical error detected; halting" err="compaction: group 0@<GROUP_KEY>: compact blocks [/var/thanos/compactor/compact/0@<GROUP_KEY>/<BLOCK_A> /var/thanos/compactor/compact/0@<GROUP_KEY>/<BLOCK_B> /var/thanos/compactor/compact/0@<GROUP_KEY>/<BLOCK_C> /var/thanos/compactor/compact/0@<GROUP_KEY>/<BLOCK_D>]: cannot populate chunk <CHUNK_OFFSET> from block <BLOCK_D>: checksum mismatch expected:<EXPECTED>, actual:<ACTUAL>"

Because we don't have a backup object storage for store, I think those blocks which noticed in the compactor's log is totally broken and can't repair. So I stopped my thanos compactor, then use thanos tools bucket mark to mark those blocks to be delete and run thanos tools bucket cleanup to delete themimmediately. When these command finished, I startup the compactor and it will keep running about 10mins, it still will be go into halted. So I did the things same as before. After 10+ loops for block removal, I think it may not be the problem of these blocks. Then I ran thanos bucket verify -i index_known_issues --id <broken-block-id> (I have to use this because without this flag, it will download all blocks to temp and wont cleanup these files, it ran out of my filesystem.) , it shows verify ok without any error or warn logs.
After many trials, finally I go to use client to download the newest checksum error block to my local disk, and used promtool tsdb analyze to check this file if it is broken. That shows it can be read successfully. So this make me confused and I don't know who to let the compactor run correctly. I have about 20000 blocks in MinIO.

[root@<HOSTNAME> <USER>]# ./promtool tsdb analyze ./data
Block ID: <BLOCK_D>
Duration: <~2h>
Total Series: <NUM>
Label names: <NUM>
Postings (unique label pairs): <NUM>
Postings entries (total label pairs): <NUM>

Label pairs most involved in churning:
<COUNT> namespace=<NAMESPACE_A>
<COUNT> env=<ENV_A>
<COUNT> vendor=<VENDOR_A>
<COUNT> region=<REGION_A>
<COUNT> job=<JOB_A>
<COUNT> instance=<IP_A>:<PORT>
<COUNT> os=<OS_A>
<COUNT> department=<DEPT_A>
<COUNT> service=<SERVICE_A>

Label names most involved in churning:
__name__
instance
job
namespace
name
ip
env
os
region
vendor
department
device
pod
endpoint
service
container
node
metrics_path
id

Most common label pairs:
namespace=<NAMESPACE_A>
env=<ENV_A>
vendor=<VENDOR_A>
region=<REGION_A>
job=<JOB_A>
os=<OS_A>
service=<SERVICE_B>
endpoint=<ENDPOINT_A>

Label names with highest cumulative label value length:
__name__
mountpoint
url
id
path
name
walPath
device
address
container_id
uid
display_name
type
image
exported_name
UUID
image_id

Highest cardinality labels:
__name__
device
address
id
mountpoint
path
name
walPath
url
process_id
exported_name
type
display_name
uid
serial
container_id
tid
wwn
ip
instance

Highest cardinality metric names:
node_cpu_seconds_total
apiserver_request_duration_seconds_bucket
windows_service_status
container_cpu_usage_seconds_total
rest_client_request_duration_seconds_bucket
node_scrape_collector_duration_seconds

Steps which I been tried but doesn't work

  1. Delete those blocks which shows checksum error in thanos compactor's logs. (Still shows error in another new block)
  2. Use thanos tools bucket verify to check the whole blocks in MinIO. (No logs shows have problem.)
  3. Try to just restart compactor (Still shows error in the same block.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions