Skip to content

Commit 8d6bfbc

Browse files
committed
docs: remove unverified timeout scenarios and NLB troubleshooting from Spark connector
Remove the Common timeout scenarios section and the Connection resets on AWS PrivateLink or NLB troubleshooting entry. The NLB broken pipe scenario could not be reproduced from EMR against the staging endpoint (public HTTPS path, no NLB in the TCP path), so this content is not verified and should not be published.
1 parent 7aa14a8 commit 8d6bfbc

1 file changed

Lines changed: 16 additions & 86 deletions

File tree

docs/integrations/data-ingestion/apache-spark/spark-native-connector.md

Lines changed: 16 additions & 86 deletions
Original file line numberDiff line numberDiff line change
@@ -1490,7 +1490,7 @@ Alternatively, set them in `spark-defaults.conf` or when creating the Spark sess
14901490
| spark.clickhouse.read.fixedStringAs | binary | Read ClickHouse FixedString type as the specified Spark data type. Supported types: binary, string | 0.8.0 |
14911491
| spark.clickhouse.read.format | json | Serialize format for reading. Supported formats: json, binary | 0.6.0 |
14921492
| spark.clickhouse.read.runtimeFilter.enabled | false | Enable runtime filter for reading. | 0.8.0 |
1493-
| spark.clickhouse.read.settings | | Comma-separated list of ClickHouse server session settings to apply on read, e.g. `max_execution_time=300,max_memory_usage=10000000000`. These are passed as HTTP query parameters on every read request. Must be set via `spark.conf.set(...)` at the session level — passing via DataFrame `.option()` has no effect. | 0.9.0 |
1493+
| spark.clickhouse.read.settings | | ClickHouse settings appended as a `SETTINGS` clause to every generated read query, e.g. `final=1, max_execution_time=300`. The value is lowercased and inserted verbatim after `SETTINGS`, so it must be valid SQL. Must be set via `spark.conf.set(...)` at the session level — passing via DataFrame `.option()` has no effect. | 0.9.0 |
14941494
| spark.clickhouse.read.splitByPartitionId | true | If `true`, construct input partition filter by virtual column `_partition_id`, instead of partition value. There are known issues with assembling SQL predicates by partition value. This feature requires ClickHouse Server v21.6+ | 0.4.0 |
14951495
| spark.clickhouse.useNullableQuerySchema | false | If `true`, mark all the fields of the query schema as nullable when executing `CREATE/REPLACE TABLE ... AS SELECT ...` on creating the table. Note, this configuration requires SPARK-43390(available in Spark 3.5), w/o this patch, it always acts as `true`. | 0.8.0 |
14961496
| spark.clickhouse.write.batchSize | 10000 | The number of records per batch on writing to ClickHouse. | 0.1.0 |
@@ -1682,8 +1682,8 @@ With these settings, all reads and writes go through the single coordinator node
16821682

16831683
The connector uses ClickHouse's HTTP interface via the [ClickHouse Java client](https://github.com/ClickHouse/clickhouse-java). There are two independent ways to pass settings:
16841684

1685-
- **`option.clickhouse_setting_<name>`** — the first-class mechanism for passing ClickHouse server session settings through the Java client. Applies to all operations for the lifetime of the catalog.
1686-
- **`spark.clickhouse.read.settings`**per-read override: a comma-separated list of `key=value` server settings applied only to read requests.
1685+
- **`option.clickhouse_setting_<name>`** — the first-class mechanism for passing ClickHouse server settings through the Java client. Applied to every client operation (reads and writes) for the lifetime of the catalog.
1686+
- **`option.http_header_<name>`**adds a custom HTTP header to every request the Java client sends to ClickHouse. Useful for setting ClickHouse-recognised headers (e.g. `X-ClickHouse-Quota`) or for passing headers expected by a proxy or gateway in front of the server. Applied to every operation for the lifetime of the catalog.
16871687

16881688
### Via Catalog API {#query-settings-catalog-api}
16891689

@@ -1695,18 +1695,24 @@ spark.sql.catalog.clickhouse.option.clickhouse_setting_wait_for_async_insert 1
16951695
spark.sql.catalog.clickhouse.option.clickhouse_setting_max_execution_time 300
16961696
```
16971697

1698-
To enable TLS, set the `protocol` direct catalog property instead of an `option.*` key:
1698+
To add custom HTTP headers to every request, use `option.http_header_<name>`. Headers recognised by ClickHouse itself (e.g. `X-ClickHouse-Quota`) will be honoured by the server; any other header is only useful if a proxy or gateway in front of ClickHouse is configured to read it.
16991699

17001700
```text
1701-
spark.sql.catalog.clickhouse.protocol https
1702-
spark.sql.catalog.clickhouse.http_port 8443
1701+
spark.sql.catalog.clickhouse.option.http_header_x_clickhouse_quota spark_etl_quota
1702+
```
1703+
1704+
To enable TLS, set `option.ssl=true` and point `http_port` at the HTTPS endpoint:
1705+
1706+
```text
1707+
spark.sql.catalog.clickhouse.option.ssl true
1708+
spark.sql.catalog.clickhouse.http_port 8443
17031709
```
17041710

17051711
For the full list of available Java client option keys, see the [Java client configuration documentation](https://clickhouse.com/docs/integrations/language-clients/java/client#configuration). For available ClickHouse server session settings, see the [settings documentation](https://clickhouse.com/docs/operations/settings/settings).
17061712

17071713
### Via TableProvider API {#query-settings-tableprovider-api}
17081714

1709-
Use `spark.clickhouse.read.settings` to apply ClickHouse server settings to reads. This setting is read from the Spark session config — passing it as a DataFrame `.option()` has no effect. It applies to all reads in the session until unset.
1715+
Use `spark.clickhouse.read.settings` to apply ClickHouse settings to reads. The value is appended as a `SETTINGS` clause to every generated read query, so it must be valid SQL (e.g. `final=1, max_execution_time=300`). This setting is read from the Spark session config — passing it as a DataFrame `.option()` has no effect. It applies to all reads in the session until unset.
17101716

17111717
To scope the settings to a single read, set the config, force the action, then unset:
17121718

@@ -1847,7 +1853,7 @@ The connector enforces its own timeouts that are independent of any Spark or Cli
18471853

18481854
The 60-second query cap is enforced by the connector regardless of `max_execution_time` or any other server setting. If a read query takes longer than 60 seconds end-to-end, the connector will abort it. There is no `spark.clickhouse.*` setting to override this value.
18491855

1850-
Inserts have no connector-level timeout. This means a stalled or very slow insert will hang until a network device terminates the connection — which can produce a **"Broken pipe"** error (see [Connection resets on AWS PrivateLink or NLB](#troubleshooting-connection-resets) below).
1856+
Inserts have no connector-level timeout. This means a stalled or very slow insert will hang until a network device terminates the connection — which can produce a **"Broken pipe"** error.
18511857

18521858
### Java client connection timeouts {#timeout-java-client}
18531859

@@ -1856,7 +1862,7 @@ These are passed through to the underlying ClickHouse Java client using the `opt
18561862
| Catalog property | Default | Unit | What it controls |
18571863
|---|---|---|---|
18581864
| `option.socket_timeout` | `0` (unlimited) | ms | Deadline for each TCP read/write operation. `0` means no deadline — the call blocks indefinitely if the server stops responding. |
1859-
| `option.connection_timeout` | not set | ms | Time allowed to establish a new TCP connection to ClickHouse. |
1865+
| `option.connection_timeout` | not set (Java client default) | ms | Time allowed to establish a new TCP connection to ClickHouse. |
18601866
| `option.connection_ttl` | `-1` (unlimited) | ms | Maximum lifetime of a pooled connection. After this time, the connection is retired and not reused. **Set this below the idle timeout of any network appliance between Spark and ClickHouse** (e.g., `300000` for AWS NLB). |
18611867
| `option.http_keep_alive_timeout` | server default | ms | Overrides the HTTP keep-alive timeout. Set to `0` to disable keep-alive on network paths with aggressive idle-connection cutoffs. |
18621868

@@ -1869,27 +1875,6 @@ These are ClickHouse query settings sent with each request. They instruct the Cl
18691875
| `option.clickhouse_setting_max_execution_time` | `0` (unlimited) | seconds | Server-side hard cap on query execution time. Useful for preventing runaway reads from consuming server resources, but **does not override the connector's 60-second query timeout**. |
18701876
| `option.clickhouse_setting_session_timeout` | `60` | seconds | HTTP session lifetime on the server. |
18711877

1872-
### Common timeout scenarios {#timeout-scenarios}
1873-
1874-
**Long-running reads (>60 seconds)**
1875-
1876-
The connector's hard-coded 60-second query timeout will abort reads that take longer, even if the server could complete them. There is no configuration knob to extend or disable this cap. The only workaround is to reduce the amount of data returned by a single query so each read completes within 60 seconds:
1877-
1878-
- Add `WHERE` predicates to filter rows before they reach Spark.
1879-
- Use partition pruning so the connector issues one query per partition rather than one large scan.
1880-
1881-
**Insert "Broken pipe" or "Connection reset" errors**
1882-
1883-
Typically caused by a network appliance (AWS NLB, PrivateLink, corporate proxy) silently dropping idle connections. The connector's connection pool holds connections open between operations; if the appliance's idle timeout (350 seconds for AWS NLB) expires, the connection is dropped at the network level without notification. The connector then tries to reuse a dead connection and gets a broken pipe.
1884-
1885-
```text
1886-
spark.sql.catalog.clickhouse.option.connection_ttl 300000
1887-
spark.sql.catalog.clickhouse.option.socket_timeout 300000
1888-
spark.sql.catalog.clickhouse.option.http_keep_alive_timeout 0
1889-
```
1890-
1891-
Setting `connection_ttl` below the appliance's idle timeout forces the connector to retire connections proactively. See [Connection resets on AWS PrivateLink or NLB](#troubleshooting-connection-resets) for full details.
1892-
18931878
## Performance tuning {#performance-tuning}
18941879

18951880
### Read performance {#read-performance}
@@ -1908,10 +1893,9 @@ Setting `connection_ttl` below the appliance's idle timeout forces the connector
19081893
| Tuning | Configuration | Notes |
19091894
|---|---|---|
19101895
| **Increase batch size** | `spark.clickhouse.write.batchSize=50000` | Default is 10,000. Larger batches reduce round-trips. |
1911-
| **Arrow write format** | `spark.clickhouse.write.format=arrow` (default) | Arrow is faster than JSON for most data types. Use `json` only for Variant/JSON column types. |
1896+
| **Arrow write format** | `spark.clickhouse.write.format=arrow` (default) | Arrow is faster than JSON for most data types. On Spark 4.0 the Arrow writer handles `VariantType` columns via an internal JSON-string conversion; on Spark 3.x, fall back to `json` if your table has Variant or JSON columns. |
19121897
| **Compression** | `spark.clickhouse.write.compression.codec=lz4` (default) | Reduces network transfer during writes. |
19131898
| **Repartition by partition** | `spark.clickhouse.write.repartitionByPartition=true` (default) | Groups rows by partition before writing, reducing the number of parts created. |
1914-
| **Explicit partition count** | `spark.clickhouse.write.repartitionNum=N` | Forces Spark to repartition to exactly N partitions before writing (via `requiredNumPartitions()`). Default `0` means no requirement. Alternatively, use `df.repartition(N)` before the write call. |
19151899

19161900
### Recommended starting configuration for bulk loads {#bulk-load-config}
19171901

@@ -1952,60 +1936,6 @@ spark.conf.set("spark.clickhouse.write.repartitionByPartition", "true")
19521936
If you also have `spark.clickhouse.write.distributed.convertLocal=true`, ignoring unsupported sharding keys can cause incorrect data distribution. In that case, either use a supported sharding key or set `spark.clickhouse.write.distributed.convertLocal.allowUnsupportedSharding=true` only if you have verified your data distribution — rows may be written to the wrong shard, causing data skew or incorrect query results when querying shards directly.
19531937
:::
19541938

1955-
---
1956-
1957-
### Table not found or stale schema after DDL {#troubleshooting-stale-schema}
1958-
1959-
**Symptom**: After creating or altering a ClickHouse table, Spark still sees the old schema or raises a "table not found" error.
1960-
1961-
**Cause**: Spark caches catalog metadata. For ClickHouse Cloud, replication to all nodes may also take a moment.
1962-
1963-
**Fix**:
1964-
```python
1965-
spark.catalog.refreshTable("clickhouse.database.table_name")
1966-
```
1967-
1968-
---
1969-
1970-
### Connection resets on AWS PrivateLink or NLB {#troubleshooting-connection-resets}
1971-
1972-
**Symptom**: Reads or writes intermittently fail with `Broken pipe`, `Connection reset by peer`, or `java.io.IOException: Connection closed` — particularly on longer-running jobs or after a period of inactivity.
1973-
1974-
**Cause**: AWS Network Load Balancers (NLB) and PrivateLink silently drop TCP connections that have been idle for **350 seconds**. The connector's connection pool keeps HTTP connections alive between operations. When the NLB drops a pooled connection, neither the connector nor ClickHouse is notified. The next operation that picks up that connection fails immediately.
1975-
1976-
Inserts are especially vulnerable: there is no connector-level insert timeout, so a slow or large insert can stall mid-flight when the underlying connection is cut.
1977-
1978-
**Fix**: Retire pooled connections before the NLB does by setting `connection_ttl` below 350 seconds. Also set `socket_timeout` to bound individual operations and disable HTTP keep-alive to prevent stale connections from persisting in the pool:
1979-
1980-
```text
1981-
spark.sql.catalog.clickhouse.option.connection_ttl 300000
1982-
spark.sql.catalog.clickhouse.option.socket_timeout 300000
1983-
spark.sql.catalog.clickhouse.option.http_keep_alive_timeout 0
1984-
```
1985-
1986-
For very large inserts that are genuinely expected to take several minutes, switch to async inserts so the server acknowledges receipt quickly and the connector is not waiting on a long-lived open connection:
1987-
1988-
```text
1989-
spark.sql.catalog.clickhouse.option.clickhouse_setting_async_insert 1
1990-
spark.sql.catalog.clickhouse.option.clickhouse_setting_wait_for_async_insert 1
1991-
```
1992-
1993-
---
1994-
1995-
### KEEPER_EXCEPTION during writes {#troubleshooting-keeper-exception}
1996-
1997-
**Symptom**: Writes fail with `Coordination::Exception: Can't get data for node ... node doesn't exist (KEEPER_EXCEPTION)`. Retries may produce duplicate data.
1998-
1999-
**Cause**: Too many concurrent Spark tasks overwhelm the ClickHouse replica's coordination layer (ClickHouse Keeper / ZooKeeper). This is common when the replica has limited CPU or memory relative to the write concurrency.
2000-
2001-
**Fix**:
2002-
- Reduce the number of Spark write tasks by repartitioning the DataFrame before writing:
2003-
```python
2004-
df.repartition(N).writeTo("clickhouse.db.table").append()
2005-
```
2006-
- Increase the ClickHouse replica's compute resources (CPU / memory).
2007-
- If using `ReplicatedMergeTree`, consider writing through a `Distributed` table to spread load across shards.
2008-
20091939
## Contributing and support {#contributing-and-support}
20101940

20111941
If you'd like to contribute to the project or report any issues, we welcome your input!

0 commit comments

Comments
 (0)