Skip to content

Guide for troubleshooting slow insert performance #652

@mshustov

Description

@mshustov

We need comprehensive documentation to help users diagnose and resolve slow insert performance issues with the ClickHouse Kafka Connect Sink connector. Slow inserts directly lead to increased consumer lag, which can cascade into data freshness issues and potential data loss in extreme cases.

Proposed structure:

  1. Overview: How insert time directly impacts consumer lag and what metrics to track
  2. Root cause categories
    2.1 ClickHouse server-side issues

Symptoms:

  • All connector tasks experiencing slowdown simultaneously
  • ClickHouse system metrics showing degradation

Diagnostic Steps:

Check ClickHouse server metrics:

  • CPU usage
  • Memory pressure and cache hit rates
  • Disk I/O utilization and queue depth
  • Number of running queries (system.processes)
  • Merge operations in progress (system.merges)

Analyze query log for slow inserts (system.query_log)
Check for table-level issues:

  • Too many parts (system.parts - parts count per partition)
  • Mutations in progress
  • Replication lag (if using replicated tables)

Network latency between Connect worker and ClickHouse

Common Causes:

  • Resource exhaustion (CPU, memory, disk)
  • Heavy concurrent query load
  • Large background merge operations

Resolution Strategies:

Scale ClickHouse resources (vertical/horizontal)
Optimize table schema and partition strategy
Tune ClickHouse settings (max_insert_threads, max_threads, etc.)
Schedule heavy queries during off-peak hours
Review and optimize merge settings

2.2 Batch Size and Payload Characteristics
Symptoms:

  • Gradual degradation over time
  • Correlation with message size increase in Kafka topics
  • Per-record processing time remains constant, but batch processing time increases

Diagnostic Steps:

Monitor connector metrics:

  • batch size
  • records per batch
  • average message size from Kafka topics

Check connector configuration: TBD

Analyze payload characteristics:

  • Row width (number of columns)
  • Data types and their sizes (e.g., large String fields)
  • Compression ratio

Common Causes:

  • Batch size too large for available memory
  • Message size growth (e.g., new fields added to schema)
  • Inefficient format or serialization/deserialization
  • Network bandwidth saturation
  • Memory pressure causing GC pauses

Resolution Strategies:

  • Tune batch size parameters to balance throughput and latency
  • Increase tasks.max to parallelize inserts
  • Enable compression in Kafka and ClickHouse
  • Optimize schema (e.g., use appropriate data types, normalize large fields)
  • Scale Connect worker resources

2.3 Client Timeouts and Error Handling
Symptoms:

  • Intermittent spikes in insert time
  • Error logs showing timeouts or retries
  • Consumer lag spikes followed by recovery
  • increased retry attempts

Diagnostic Steps:

Review Connect worker logs for:

  • Timeout exceptions
  • Connection reset errors
  • HTTP client errors (if using HTTP protocol)
  • SSL/TLS handshake failures

Check connector error metrics:

  • sink-record-send-total vs sink-record-active-count
  • Task failure count
  • Retry attempts

Network diagnostics:

  • Packet loss between Connect worker and ClickHouse
  • DNS resolution issues
  • Firewall/security group changes
  • Load balancer health

Common Causes:

  • Network instability or misconfiguration
  • ClickHouse connection pool exhaustion
  • Timeout settings too aggressive
  • ClickHouse overloaded and rejecting connections
  • Transient errors not handled gracefully
  • Dead letter queue issues causing task restarts

Resolution Strategies:

Tune timeout settings: TODO
Implement exponential backoff
Configure appropriate connection pool size
Enable error tolerance and DLQ:

  • errors.tolerance=all
  • errors.deadletterqueue.topic.name
  • Monitor and alert on network issues
  • Use connection keep-alive settings
  1. Additional Factors to Consider
    3.1 Kafka Consumer-Side Issues
  • Consumer group rebalancing
  • Fetch size and throughput from Kafka
  • Topic partition count vs tasks.max alignment
  • Offset commit frequency impacting throughput

3.2 Connect Worker Resource Constraints

  • JVM heap pressure and GC pauses
  • CPU saturation on Connect worker
  • Thread pool exhaustion
  • Memory leaks in connector or converters

3.3 Data Transformation Overhead

  • Complex Single Message Transforms (SMTs)
  • Inefficient converters (Avro, JSON, Protobuf)
  • Schema registry latency

3.4 ClickHouse Table Design Issues

  • Inappropriate PRIMARY KEY / ORDER BY
  • Missing or inefficient indexes
  • Codec selection for compression
  • TTL rules causing excessive mutations
  1. Monitoring and Observability
    Key Metrics to Track:
    Connector-Level Metrics (JMX): TODO

  2. Step-by-Step troubleshooting workflow

  3. Configuration Best Practices
    Include a section with recommended configuration templates for:

  • High-throughput scenarios
  • Low-latency scenarios
  • High-availability setups
  • Resource-constrained environments
  1. Common anti-patterns to avoid
  • Setting batch size too small (increases insert frequency)
  • Not enabling compression
  • Using too few tasks for high-volume topics
  • Using ClickHouse Async INSERT feature for latency-sensitive scenarious
  • Not monitoring parts count
  • Not using DLQ

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions