-
Notifications
You must be signed in to change notification settings - Fork 61
Description
We need comprehensive documentation to help users diagnose and resolve slow insert performance issues with the ClickHouse Kafka Connect Sink connector. Slow inserts directly lead to increased consumer lag, which can cascade into data freshness issues and potential data loss in extreme cases.
Proposed structure:
- Overview: How insert time directly impacts consumer lag and what metrics to track
- Root cause categories
2.1 ClickHouse server-side issues
Symptoms:
- All connector tasks experiencing slowdown simultaneously
- ClickHouse system metrics showing degradation
Diagnostic Steps:
Check ClickHouse server metrics:
- CPU usage
- Memory pressure and cache hit rates
- Disk I/O utilization and queue depth
- Number of running queries (system.processes)
- Merge operations in progress (system.merges)
Analyze query log for slow inserts (system.query_log)
Check for table-level issues:
- Too many parts (system.parts - parts count per partition)
- Mutations in progress
- Replication lag (if using replicated tables)
Network latency between Connect worker and ClickHouse
Common Causes:
- Resource exhaustion (CPU, memory, disk)
- Heavy concurrent query load
- Large background merge operations
Resolution Strategies:
Scale ClickHouse resources (vertical/horizontal)
Optimize table schema and partition strategy
Tune ClickHouse settings (max_insert_threads, max_threads, etc.)
Schedule heavy queries during off-peak hours
Review and optimize merge settings
2.2 Batch Size and Payload Characteristics
Symptoms:
- Gradual degradation over time
- Correlation with message size increase in Kafka topics
- Per-record processing time remains constant, but batch processing time increases
Diagnostic Steps:
Monitor connector metrics:
- batch size
- records per batch
- average message size from Kafka topics
Check connector configuration: TBD
Analyze payload characteristics:
- Row width (number of columns)
- Data types and their sizes (e.g., large String fields)
- Compression ratio
Common Causes:
- Batch size too large for available memory
- Message size growth (e.g., new fields added to schema)
- Inefficient format or serialization/deserialization
- Network bandwidth saturation
- Memory pressure causing GC pauses
Resolution Strategies:
- Tune batch size parameters to balance throughput and latency
- Increase
tasks.maxto parallelize inserts - Enable compression in Kafka and ClickHouse
- Optimize schema (e.g., use appropriate data types, normalize large fields)
- Scale Connect worker resources
2.3 Client Timeouts and Error Handling
Symptoms:
- Intermittent spikes in insert time
- Error logs showing timeouts or retries
- Consumer lag spikes followed by recovery
- increased retry attempts
Diagnostic Steps:
Review Connect worker logs for:
- Timeout exceptions
- Connection reset errors
- HTTP client errors (if using HTTP protocol)
- SSL/TLS handshake failures
Check connector error metrics:
- sink-record-send-total vs sink-record-active-count
- Task failure count
- Retry attempts
Network diagnostics:
- Packet loss between Connect worker and ClickHouse
- DNS resolution issues
- Firewall/security group changes
- Load balancer health
Common Causes:
- Network instability or misconfiguration
- ClickHouse connection pool exhaustion
- Timeout settings too aggressive
- ClickHouse overloaded and rejecting connections
- Transient errors not handled gracefully
- Dead letter queue issues causing task restarts
Resolution Strategies:
Tune timeout settings: TODO
Implement exponential backoff
Configure appropriate connection pool size
Enable error tolerance and DLQ:
errors.tolerance=allerrors.deadletterqueue.topic.name- Monitor and alert on network issues
- Use connection keep-alive settings
- Additional Factors to Consider
3.1 Kafka Consumer-Side Issues
- Consumer group rebalancing
- Fetch size and throughput from Kafka
- Topic partition count vs tasks.max alignment
- Offset commit frequency impacting throughput
3.2 Connect Worker Resource Constraints
- JVM heap pressure and GC pauses
- CPU saturation on Connect worker
- Thread pool exhaustion
- Memory leaks in connector or converters
3.3 Data Transformation Overhead
- Complex Single Message Transforms (SMTs)
- Inefficient converters (Avro, JSON, Protobuf)
- Schema registry latency
3.4 ClickHouse Table Design Issues
- Inappropriate PRIMARY KEY / ORDER BY
- Missing or inefficient indexes
- Codec selection for compression
- TTL rules causing excessive mutations
-
Monitoring and Observability
Key Metrics to Track:
Connector-Level Metrics (JMX): TODO -
Step-by-Step troubleshooting workflow
-
Configuration Best Practices
Include a section with recommended configuration templates for:
- High-throughput scenarios
- Low-latency scenarios
- High-availability setups
- Resource-constrained environments
- Common anti-patterns to avoid
- Setting batch size too small (increases insert frequency)
- Not enabling compression
- Using too few tasks for high-volume topics
- Using ClickHouse Async INSERT feature for latency-sensitive scenarious
- Not monitoring parts count
- Not using DLQ