[SPARK-52505][K8S] Allow to create executor kubernetes service by EnricoMi · Pull Request #96 · G-Research/spark

EnricoMi · 2025-06-13T09:44:12Z

What changes were proposed in this pull request?

This allows executors to register its block manager with the driver via a Kubernetes service name rather than the pod IP. This allows driver and executor to connect to the executor block manager via the service.

Why are the changes needed?

In Kubernetes, connecting to an evicted (decommissioned) executor times out after 2 minutes (default). Executors connect to other executors synchronously (one at a time), so this time out accumulates for each executor peer. An executor that reads from many decommissioned executors blocks for a multiple of the timeout until it fails with a fetch failure.

This can be fixed by binding the block manager to a fixed port, defining a Kubernetes service for that block manager port and have the executor register that K8S service port with the driver. The driver and other executors then connect to the service name and instantly fail with a connection refused if the executor got decommissioned and the service removed.

Setting spark.kubernetes.executor.enableService=true and defining spark.blockManager.port will perform this setup for each executor.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

… and MutableColumnarRow ### What changes were proposed in this pull request? `ColumnarBatchRow.copy()` and `MutableColumnarRow.copy()`/`get()` do not handle `VariantType`, causing a `RuntimeException: Not implemented. VariantType` when using `VariantType` columns with streaming custom data sources that rely on columnar batch row copying. PR apache#53137 (SPARK-54427) added `VariantType` support to `ColumnarRow` but missed `ColumnarBatchRow` and `MutableColumnarRow`. PR apache#54006 attempted this fix but was closed. This patch adds: - `PhysicalVariantType` branch in `ColumnarBatchRow.copy()` - `VariantType` branch in `MutableColumnarRow.copy()` and `get()` - Test in `ColumnarBatchSuite` validating `VariantVal` round-trip through `copy()` ### Why are the changes needed? Without this fix, any streaming pipeline that returns `VariantType` columns from a custom columnar data source throws a runtime exception when Spark attempts to copy the batch row. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix for an existing feature. ### How was this patch tested? Added a new test `[SPARK-55552] Variant` in `ColumnarBatchSuite` that creates a `VariantType` column vector, populates it with `VariantVal` data (including a null), wraps it in a `ColumnarBatchRow`, calls `copy()`, and verifies the values round-trip correctly. ### Was this patch authored or co-authored using generative AI tooling? Yes. GitHub Copilot was used to assist in drafting portions of this contribution. This contribution is my original work and I license it under the Apache 2.0 license. Closes apache#54337 from tugceozberk/fix-columnar-batch-row-variant-copy. Authored-by: tugce-applied <tugce@appliedcomputing.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…y and is_geography ### What changes were proposed in this pull request? Add `field.metadata is not None` check in `is_geometry()` and `is_geography()` before accessing `field.metadata` with the `in` operator. ### Why are the changes needed? PyArrow struct fields have `metadata=None` by default. When `from_arrow_type()` encounters a struct with a field named `wkb` that has no metadata, `is_geometry()` / `is_geography()` crash with `TypeError: argument of type 'NoneType' is not iterable` because the code does `b"geometry" in field.metadata` without checking for `None` first. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Verified with a dedicated unit test (see commit a3fbd9b) that constructs a PyArrow struct with `wkb` and `srid` fields without metadata, confirming the `TypeError` before the fix and a clean pass after. The test was then removed (commit da08fdc) to reduce test burden since the fix is a trivial one-line defensive check. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#54295 from Yicong-Huang/SPARK-55507/fix/is-geometry-none-metadata. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…g for SQL functions ### What changes were proposed in this pull request? Propagate the new grouping "sketch_funcs" to the rest of the files where datasketches are used. Reference PR: apache#54061 ### Why are the changes needed? apache#54061 created a new group called sketch_funcs and moved them out the datasketch functions from misc_funcs group. The new grouping has not propagated through the whole codebase, so this is an attempt to do so ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? N/A Closes apache#54231 from cboumalh/SPARK-55279-sketch-funcs-group-followup. Lead-authored-by: Chris Boumalhab <cboumalh@amazon.com> Co-authored-by: Chris Boumalhab <84485659+cboumalh@users.noreply.github.com> Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>

…r applyInPandas ### What changes were proposed in this pull request? Optimize the non-iterator `applyInPandas` path by merging Arrow batches at the Arrow level before converting to pandas, instead of converting each batch individually and reassembling via per-column `pd.concat`. Changes: - **`GroupPandasUDFSerializer.load_stream`**: Yield raw `Iterator[pa.RecordBatch]` instead of converting to pandas per-batch via `ArrowBatchTransformer.to_pandas`. - **Non-iterator mapper**: Collect all Arrow batches → `pa.Table.from_batches().combine_chunks()` → convert to pandas once for the entire group. - **`wrap_grouped_map_pandas_udf`**: Simplified to accept a list of pandas Series directly. - **Iterator mapper**: Split into its own `elif` branch; still converts batches lazily via `ArrowBatchTransformer.to_pandas` per-batch. ### Why are the changes needed? Follow-up to SPARK-55459. After SPARK-54316 consolidated the grouped-map serializer, the non-iterator `applyInPandas` lost its efficient Arrow-level batch merge and instead converts each batch to pandas individually, then reassembles via per-column `pd.concat`. This PR restores the Arrow-level merge so that all batches within a group are merged into a single `pa.Table` and converted to pandas once, rather than N times (once per batch). A pure-Python microbenchmark (335 groups × 100K rows × 5 columns, 7 runs each): | Approach | avg | min | vs Master | |---|---|---|---| | Master (per-batch convert + per-column concat) | 0.544s | 0.543s | baseline | | This PR (Arrow-level merge + single convert) | 0.489s | 0.487s | **10% faster** | ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing `applyInPandas` tests (`test_pandas_grouped_map.py`, `test_pandas_cogrouped_map.py`). ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54327 from Yicong-Huang/SPARK-55529/optimize-apply-in-pandas-arrow-merge. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Takuya Ueshin <ueshin@databricks.com>

### What changes were proposed in this pull request? This PR aims to upgrade `MySQL` image to `9.6.0` for Apache Spark 4.2.0. ### Why are the changes needed? To maintain our JDBC test coverage up-to-date by testing against the latest version of `MySQL`. - https://dev.mysql.com/doc/relnotes/mysql/9.6/en/ (2026-01-20) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: `Gemini 3 Pro (High)` on `Antigravity` Closes apache#54346 from dongjoon-hyun/SPARK-55572. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…o-int unsafe cast golden file mismatches ### What changes were proposed in this pull request? Add hardcoded ARM (aarch64/arm64) overrides in `_overrides_unsafe()` for 34 float-to-integer unsafe cast golden file mismatches. Also rename `_version_overrides_safe/unsafe` to `_overrides_safe/unsafe` since overrides now cover both version and platform differences. ### Why are the changes needed? The PyArrow array cast golden files are generated on x86. On ARM, unsafe float-to-integer casts produce different results due to IEEE 754 implementation-defined behavior: - **ARM FCVT instructions** saturate on overflow: `inf` → type max, `-inf` → type min, `nan` → 0 - **x86 SSE/AVX** returns "integer indefinite" values (typically `0x80...0`) - **Negative float → unsigned int**: ARM saturates to 0, x86 may wrap This caused `test_scalar_cast_matrix_unsafe` to fail on Ubuntu 24.04 ARM CI runners with 34 mismatches across three categories: - `float*:standard` → `uint*` (negative float -1.5 → unsigned int) - `float*:special` → `int*` (inf/nan → signed int) - `float*:special` → `uint*` (inf/nan → unsigned int) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Verified with [ARM CI run](https://github.com/Yicong-Huang/spark/actions/runs/21966674224) on my fork. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#54288 from Yicong-Huang/SPARK-55405/fix/arm-float-cast-overrides. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…ogroupPandasUDFSerializer` ### What changes were proposed in this pull request? Pass explicit Spark schema (derived from each Arrow table's schema via `from_arrow_schema`) to `ArrowBatchTransformer.to_pandas()` in `CogroupPandasUDFSerializer.load_stream()`, instead of passing `None` (the inherited `_input_type`). ### Why are the changes needed? `CogroupPandasUDFSerializer` is constructed without `input_type`, so `_input_type` defaults to `None`. When `to_pandas()` receives `schema=None`, it infers the Spark schema from the Arrow batch internally via `from_arrow_type()`. This works, but: 1. The same `None` is used for both left and right DataFrames, which is conceptually wrong since they can have different schemas. 2. The schema inference is implicit rather than explicit. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests in `test_pandas_cogrouped_map.py`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54293 from Yicong-Huang/SPARK-55506/fix/cogroup-input-type. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

### What changes were proposed in this pull request? Removed test files from ignore list of ruff for `F403` and `F401`. ### Why are the changes needed? We used to need it because we use `from ... import *` in our tests. We don't do that anymore (with a few rare cases that we can ignore inline). During the time we ignore these warnings, we accumulated some real unused imports. These are cleared with this PR. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54251 from gaogaotiantian/fix-test-import. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

### What changes were proposed in this pull request? * Replaced `parameterized` with standard logic * Add `test_dl_util` to `modules.py` ### Why are the changes needed? The 3rd party library `parameterized` is unnecessary and it's not used by any other tests. The test could run without it. Then we can add it to `modules.py` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Locally it passed. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54286 from gaogaotiantian/fix-test-dl-util. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

### What changes were proposed in this pull request? * Add all dangling test files to `modules.py` * There are two helper python files that start with `test_` which is a semantics used for python test files. Rename those files and change the caller code accordingly. ### Why are the changes needed? We should run tests if they were there. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#54274 from gaogaotiantian/add-dangling-tests. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

### What changes were proposed in this pull request? This PR aims to enable `GitHub Issues` feature. ### Why are the changes needed? This will enable `GitHub Issues` feature after the vote passes. [[VOTE] Open github issues as an experimental option](https://lists.apache.org/thread/kv11qlr8j05cwqjoyddybclwcn0nv2n7). ### Does this PR introduce _any_ user-facing change? No Spark behavior change. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54334 from dongjoon-hyun/SPARK-55547. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…d revise test suites ### What changes were proposed in this pull request? This PR aims to upgrade `MariaDB` JDBC driver to `3.5.7` from `2.7.12` for Apache Spark 4.2.0. ### Why are the changes needed? To maintain the JDBC test coverage up-to-date with the latest `MariaDB` JDBC driver. - https://mariadb.com/docs/release-notes/connectors/java/3.5/3.5.7 (2025-12-07) - https://mariadb.com/docs/release-notes/connectors/java/3.4/3.4.2 (2025-03-27) - https://mariadb.com/docs/release-notes/connectors/java/3.3/3.3.4 (2025-03-27) - https://mariadb.com/docs/release-notes/connectors/java/3.2/3.2.0 (2023-08-20) - https://mariadb.com/docs/release-notes/connectors/java/3.1/3.1.4 (2023-05-01) - https://mariadb.com/docs/release-notes/connectors/java/3.0/3.0.11 (2023-08-25) Note that - `MariaDB JDBC Client v3` was released 5 years ago (May 2021) and fixed the old issue which cause a conflict with `MySQL JDBC Driver`. - In other words, `v3` client only accepts `jdbc:mariadb`. We need to have a test coverage for `v3`. - Also, there is a legacy configuration, `permitMysqlScheme`, to preserve the old behavior. We use this to minimize the size of this PR. We can remove this `permitMysqlScheme` later. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: `Gemini 3 Pro (High)` on `Antigravity` Closes apache#54348 from dongjoon-hyun/SPARK-55574. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR aims to upgrade `MySQL` JDBC driver to `9.6.0`. ### Why are the changes needed? To maintain JDBC test coverage up-to-date with the latest MySQL JDBC driver. - https://dev.mysql.com/doc/relnotes/connector-j/en/news-9-6-0.html (2026-01-21) - https://dev.mysql.com/doc/relnotes/connector-j/en/news-9-5-0.html (2025-10-22) - https://dev.mysql.com/doc/relnotes/connector-j/en/news-9-4-0.html (2025-07-22) - https://dev.mysql.com/doc/relnotes/connector-j/en/news-9-3-0.html (2025-04-15) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: `Gemini 3 Pro (High)` on `Antigravity` Closes apache#54349 from dongjoon-hyun/SPARK-55575. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…port `sleepBeforeTesting` ### What changes were proposed in this pull request? This PR aims to improve `DockerJDBCIntegrationSuite` to support `sleepBeforeTesting` ### Why are the changes needed? Some DBMS images enter `Running` status even they are not ready to receive the connnection. This new API will help us handle those images. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54355 from dongjoon-hyun/SPARK-55578. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…xecutor.service.enabled

github-actions bot added the KUBERNETES label Jun 13, 2025

EnricoMi force-pushed the k8s-executor-service branch 3 times, most recently from 0673d0d to c4818e3 Compare June 17, 2025 08:02

EnricoMi changed the title ~~Add executor feature step to create executor kubernetes service~~ [SPARK-52505][K8S] Allow to create executor kubernetes service Jun 17, 2025

EnricoMi force-pushed the k8s-executor-service branch from c4818e3 to 6bfacf0 Compare September 2, 2025 05:12

github-actions bot added CORE SQL PYTHON INFRA BUILD DOCS MLLIB STRUCTURED STREAMING WEB UI EXAMPLES ML PANDAS API ON SPARK YARN AVRO DSTREAM R CONNECT GRAPHX SPARK SHELL labels Sep 2, 2025

EnricoMi force-pushed the k8s-executor-service branch from 6bfacf0 to 52b69d8 Compare November 24, 2025 08:24

github-actions bot removed CORE SQL PYTHON INFRA labels Nov 24, 2025

tugce-applied and others added 14 commits February 17, 2026 22:14

EnricoMi force-pushed the k8s-executor-service branch from ed2d724 to f43d5b9 Compare February 18, 2026 06:59

EnricoMi added 6 commits February 18, 2026 11:26

Allow executor feature steps to create resources

01f64a3

Allow executor feature step resources to be owned by driver

5be2895

Add executor feature step to create executor kubernetes service

e2ebb41

Require block manager port to be configured

51267f7

Bump conf version, set spark issue in test names

b543432

Make the driver pod own the k8s executor service

3c1c024

EnricoMi force-pushed the k8s-executor-service branch 2 times, most recently from 97d3004 to 5783d88 Compare February 19, 2026 15:46

EnricoMi added 6 commits February 19, 2026 21:03

Add unit test for feature step

bcf3a5a

Use existing constants for k8s labels

32daab4

Rename spark.kubernetes.executor.serviceEnabled to spark.kubernetes.e…

6465768

…xecutor.service.enabled

Improve tests

1bc31fb

Add cooldown period to executor feature step services

338e8bb

Add cooldown period to executor service feature step

a02f739

EnricoMi force-pushed the k8s-executor-service branch from 5783d88 to a02f739 Compare February 19, 2026 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SPARK-52505][K8S] Allow to create executor kubernetes service#96

[SPARK-52505][K8S] Allow to create executor kubernetes service#96
EnricoMi wants to merge 469 commits intomasterfrom
k8s-executor-service

EnricoMi commented Jun 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Comments

Conversation

EnricoMi commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

EnricoMi commented Jun 13, 2025 •

edited

Loading