Skip to content

Comments

[SPARK-52505][K8S] Allow to create executor kubernetes service#96

Open
EnricoMi wants to merge 469 commits intomasterfrom
k8s-executor-service
Open

[SPARK-52505][K8S] Allow to create executor kubernetes service#96
EnricoMi wants to merge 469 commits intomasterfrom
k8s-executor-service

Conversation

@EnricoMi
Copy link

@EnricoMi EnricoMi commented Jun 13, 2025

What changes were proposed in this pull request?

This allows executors to register its block manager with the driver via a Kubernetes service name rather than the pod IP. This allows driver and executor to connect to the executor block manager via the service.

Why are the changes needed?

In Kubernetes, connecting to an evicted (decommissioned) executor times out after 2 minutes (default). Executors connect to other executors synchronously (one at a time), so this time out accumulates for each executor peer. An executor that reads from many decommissioned executors blocks for a multiple of the timeout until it fails with a fetch failure.

This can be fixed by binding the block manager to a fixed port, defining a Kubernetes service for that block manager port and have the executor register that K8S service port with the driver. The driver and other executors then connect to the service name and instantly fail with a connection refused if the executor got decommissioned and the service removed.

Setting spark.kubernetes.executor.enableService=true and defining spark.blockManager.port will perform this setup for each executor.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@EnricoMi EnricoMi force-pushed the k8s-executor-service branch 3 times, most recently from 0673d0d to c4818e3 Compare June 17, 2025 08:02
@EnricoMi EnricoMi changed the title Add executor feature step to create executor kubernetes service [SPARK-52505][K8S] Allow to create executor kubernetes service Jun 17, 2025
@EnricoMi EnricoMi force-pushed the k8s-executor-service branch from c4818e3 to 6bfacf0 Compare September 2, 2025 05:12
@EnricoMi EnricoMi force-pushed the k8s-executor-service branch from 6bfacf0 to 52b69d8 Compare November 24, 2025 08:24
tugce-applied and others added 14 commits February 17, 2026 22:14
… and MutableColumnarRow

### What changes were proposed in this pull request?

`ColumnarBatchRow.copy()` and `MutableColumnarRow.copy()`/`get()` do not handle `VariantType`, causing a `RuntimeException: Not implemented. VariantType` when using `VariantType` columns with streaming custom data sources that rely on columnar batch row copying.

PR apache#53137 (SPARK-54427) added `VariantType` support to `ColumnarRow` but missed `ColumnarBatchRow` and `MutableColumnarRow`. PR apache#54006 attempted this fix but was closed.

This patch adds:
- `PhysicalVariantType` branch in `ColumnarBatchRow.copy()`
- `VariantType` branch in `MutableColumnarRow.copy()` and `get()`
- Test in `ColumnarBatchSuite` validating `VariantVal` round-trip through `copy()`

### Why are the changes needed?

Without this fix, any streaming pipeline that returns `VariantType` columns from a custom columnar data source throws a runtime exception when Spark attempts to copy the batch row.

### Does this PR introduce _any_ user-facing change?

No. This is a bug fix for an existing feature.

### How was this patch tested?

Added a new test `[SPARK-55552] Variant` in `ColumnarBatchSuite` that creates a `VariantType` column vector, populates it with `VariantVal` data (including a null), wraps it in a `ColumnarBatchRow`, calls `copy()`, and verifies the values round-trip correctly.

### Was this patch authored or co-authored using generative AI tooling?

Yes. GitHub Copilot was used to assist in drafting portions of this contribution.

This contribution is my original work and I license it under the Apache 2.0 license.

Closes apache#54337 from tugceozberk/fix-columnar-batch-row-variant-copy.

Authored-by: tugce-applied <tugce@appliedcomputing.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…y and is_geography

### What changes were proposed in this pull request?

Add `field.metadata is not None` check in `is_geometry()` and `is_geography()` before accessing `field.metadata` with the `in` operator.

### Why are the changes needed?

PyArrow struct fields have `metadata=None` by default. When `from_arrow_type()` encounters a struct with a field named `wkb` that has no metadata, `is_geometry()` / `is_geography()` crash with `TypeError: argument of type 'NoneType' is not iterable` because the code does `b"geometry" in field.metadata` without checking for `None` first.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Verified with a dedicated unit test (see commit a3fbd9b) that constructs a PyArrow struct with `wkb` and `srid` fields without metadata, confirming the `TypeError` before the fix and a clean pass after. The test was then removed (commit da08fdc) to reduce test burden since the fix is a trivial one-line defensive check.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#54295 from Yicong-Huang/SPARK-55507/fix/is-geometry-none-metadata.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…g for SQL functions

### What changes were proposed in this pull request?

Propagate the new grouping "sketch_funcs" to the rest of the files where datasketches are used. Reference PR: apache#54061

### Why are the changes needed?

apache#54061 created a new group called sketch_funcs and moved them out the datasketch functions from misc_funcs group. The new grouping has not propagated through the whole codebase, so this is an attempt to do so

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

Existing tests

### Was this patch authored or co-authored using generative AI tooling?

N/A

Closes apache#54231 from cboumalh/SPARK-55279-sketch-funcs-group-followup.

Lead-authored-by: Chris Boumalhab <cboumalh@amazon.com>
Co-authored-by: Chris Boumalhab <84485659+cboumalh@users.noreply.github.com>
Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
…r applyInPandas

### What changes were proposed in this pull request?

Optimize the non-iterator `applyInPandas` path by merging Arrow batches at the Arrow level before converting to pandas, instead of converting each batch individually and reassembling via per-column `pd.concat`.

Changes:
- **`GroupPandasUDFSerializer.load_stream`**: Yield raw `Iterator[pa.RecordBatch]` instead of converting to pandas per-batch via `ArrowBatchTransformer.to_pandas`.
- **Non-iterator mapper**: Collect all Arrow batches → `pa.Table.from_batches().combine_chunks()` → convert to pandas once for the entire group.
- **`wrap_grouped_map_pandas_udf`**: Simplified to accept a list of pandas Series directly.
- **Iterator mapper**: Split into its own `elif` branch; still converts batches lazily via `ArrowBatchTransformer.to_pandas` per-batch.

### Why are the changes needed?

Follow-up to SPARK-55459. After SPARK-54316 consolidated the grouped-map serializer, the non-iterator `applyInPandas` lost its efficient Arrow-level batch merge and instead converts each batch to pandas individually, then reassembles via per-column `pd.concat`. This PR restores the Arrow-level merge so that all batches within a group are merged into a single `pa.Table` and converted to pandas once, rather than N times (once per batch).

A pure-Python microbenchmark (335 groups × 100K rows × 5 columns, 7 runs each):

| Approach | avg | min | vs Master |
|---|---|---|---|
| Master (per-batch convert + per-column concat) | 0.544s | 0.543s | baseline |
| This PR (Arrow-level merge + single convert) | 0.489s | 0.487s | **10% faster** |

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing `applyInPandas` tests (`test_pandas_grouped_map.py`, `test_pandas_cogrouped_map.py`).

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#54327 from Yicong-Huang/SPARK-55529/optimize-apply-in-pandas-arrow-merge.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Takuya Ueshin <ueshin@databricks.com>
### What changes were proposed in this pull request?

This PR aims to upgrade `MySQL` image to `9.6.0` for Apache Spark 4.2.0.

### Why are the changes needed?

To maintain our JDBC test coverage up-to-date by testing against the latest version of `MySQL`.
- https://dev.mysql.com/doc/relnotes/mysql/9.6/en/ (2026-01-20)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: `Gemini 3 Pro (High)` on `Antigravity`

Closes apache#54346 from dongjoon-hyun/SPARK-55572.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…o-int unsafe cast golden file mismatches

### What changes were proposed in this pull request?

Add hardcoded ARM (aarch64/arm64) overrides in `_overrides_unsafe()` for 34 float-to-integer unsafe cast golden file mismatches. Also rename `_version_overrides_safe/unsafe` to `_overrides_safe/unsafe` since overrides now cover both version and platform differences.

### Why are the changes needed?

The PyArrow array cast golden files are generated on x86. On ARM, unsafe float-to-integer casts produce different results due to IEEE 754 implementation-defined behavior:

- **ARM FCVT instructions** saturate on overflow: `inf` → type max, `-inf` → type min, `nan` → 0
- **x86 SSE/AVX** returns "integer indefinite" values (typically `0x80...0`)
- **Negative float → unsigned int**: ARM saturates to 0, x86 may wrap

This caused `test_scalar_cast_matrix_unsafe` to fail on Ubuntu 24.04 ARM CI runners with 34 mismatches across three categories:
- `float*:standard` → `uint*` (negative float -1.5 → unsigned int)
- `float*:special` → `int*` (inf/nan → signed int)
- `float*:special` → `uint*` (inf/nan → unsigned int)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Verified with [ARM CI run](https://github.com/Yicong-Huang/spark/actions/runs/21966674224) on my fork.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#54288 from Yicong-Huang/SPARK-55405/fix/arm-float-cast-overrides.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…ogroupPandasUDFSerializer`

### What changes were proposed in this pull request?

Pass explicit Spark schema (derived from each Arrow table's schema via `from_arrow_schema`) to `ArrowBatchTransformer.to_pandas()` in `CogroupPandasUDFSerializer.load_stream()`, instead of passing `None` (the inherited `_input_type`).

### Why are the changes needed?

`CogroupPandasUDFSerializer` is constructed without `input_type`, so `_input_type` defaults to `None`. When `to_pandas()` receives `schema=None`, it infers the Spark schema from the Arrow batch internally via `from_arrow_type()`. This works, but:

1. The same `None` is used for both left and right DataFrames, which is conceptually wrong since they can have different schemas.
2. The schema inference is implicit rather than explicit.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests in `test_pandas_cogrouped_map.py`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#54293 from Yicong-Huang/SPARK-55506/fix/cogroup-input-type.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
### What changes were proposed in this pull request?

Removed test files from ignore list of ruff for `F403` and `F401`.

### Why are the changes needed?

We used to need it because we use `from ... import *` in our tests. We don't do that anymore (with a few rare cases that we can ignore inline).

During the time we ignore these warnings, we accumulated some real unused imports. These are cleared with this PR.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#54251 from gaogaotiantian/fix-test-import.

Authored-by: Tian Gao <gaogaotiantian@hotmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
### What changes were proposed in this pull request?

* Replaced `parameterized` with standard logic
* Add `test_dl_util` to `modules.py`

### Why are the changes needed?

The 3rd party library `parameterized` is unnecessary and it's not used by any other tests. The test could run without it. Then we can add it to `modules.py`

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Locally it passed.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#54286 from gaogaotiantian/fix-test-dl-util.

Authored-by: Tian Gao <gaogaotiantian@hotmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
### What changes were proposed in this pull request?

* Add all dangling test files to `modules.py`
* There are two helper python files that start with `test_` which is a semantics used for python test files. Rename those files and change the caller code accordingly.

### Why are the changes needed?

We should run tests if they were there.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#54274 from gaogaotiantian/add-dangling-tests.

Authored-by: Tian Gao <gaogaotiantian@hotmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
### What changes were proposed in this pull request?

This PR aims to enable `GitHub Issues` feature.

### Why are the changes needed?

This will enable `GitHub Issues` feature after the vote passes.

[[VOTE] Open github issues as an experimental option](https://lists.apache.org/thread/kv11qlr8j05cwqjoyddybclwcn0nv2n7).

### Does this PR introduce _any_ user-facing change?

No Spark behavior change.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#54334 from dongjoon-hyun/SPARK-55547.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…d revise test suites

### What changes were proposed in this pull request?

This PR aims to upgrade `MariaDB` JDBC driver to `3.5.7` from `2.7.12` for Apache Spark 4.2.0.

### Why are the changes needed?

To maintain the JDBC test coverage up-to-date with the latest `MariaDB` JDBC driver.
- https://mariadb.com/docs/release-notes/connectors/java/3.5/3.5.7 (2025-12-07)
- https://mariadb.com/docs/release-notes/connectors/java/3.4/3.4.2 (2025-03-27)
- https://mariadb.com/docs/release-notes/connectors/java/3.3/3.3.4 (2025-03-27)
- https://mariadb.com/docs/release-notes/connectors/java/3.2/3.2.0 (2023-08-20)
- https://mariadb.com/docs/release-notes/connectors/java/3.1/3.1.4 (2023-05-01)
- https://mariadb.com/docs/release-notes/connectors/java/3.0/3.0.11 (2023-08-25)

Note that
- `MariaDB JDBC Client v3` was released 5 years ago (May 2021) and fixed the old issue which cause a conflict with `MySQL JDBC Driver`.
- In other words, `v3` client only accepts `jdbc:mariadb`. We need to have a test coverage for `v3`.
- Also, there is a legacy configuration, `permitMysqlScheme`, to preserve the old behavior. We use this to minimize the size of this PR. We can remove this `permitMysqlScheme` later.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: `Gemini 3 Pro (High)` on `Antigravity`

Closes apache#54348 from dongjoon-hyun/SPARK-55574.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?

This PR aims to upgrade `MySQL` JDBC driver to `9.6.0`.

### Why are the changes needed?

To maintain JDBC test coverage up-to-date with the latest MySQL JDBC driver.
- https://dev.mysql.com/doc/relnotes/connector-j/en/news-9-6-0.html (2026-01-21)
- https://dev.mysql.com/doc/relnotes/connector-j/en/news-9-5-0.html (2025-10-22)
- https://dev.mysql.com/doc/relnotes/connector-j/en/news-9-4-0.html (2025-07-22)
- https://dev.mysql.com/doc/relnotes/connector-j/en/news-9-3-0.html (2025-04-15)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: `Gemini 3 Pro (High)` on `Antigravity`

Closes apache#54349 from dongjoon-hyun/SPARK-55575.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…port `sleepBeforeTesting`

### What changes were proposed in this pull request?

This PR aims to improve `DockerJDBCIntegrationSuite` to support `sleepBeforeTesting`

### Why are the changes needed?

Some DBMS images enter `Running` status even they are not ready to receive the connnection. This new API will help us handle those images.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#54355 from dongjoon-hyun/SPARK-55578.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@EnricoMi EnricoMi force-pushed the k8s-executor-service branch from ed2d724 to f43d5b9 Compare February 18, 2026 06:59
@EnricoMi EnricoMi force-pushed the k8s-executor-service branch 2 times, most recently from 97d3004 to 5783d88 Compare February 19, 2026 15:46
@EnricoMi EnricoMi force-pushed the k8s-executor-service branch from 5783d88 to a02f739 Compare February 19, 2026 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.