Skip to content

Commit 1d6313e

Browse files
committed
Merge remote-tracking branch 'apache/main' into fix/string-audit-followups
2 parents 0a0c869 + 4c88f5d commit 1d6313e

259 files changed

Lines changed: 13036 additions & 3495 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/skills/audit-comet-expression/SKILL.md

Lines changed: 135 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
name: audit-comet-expression
3-
description: Audit an existing Comet expression for correctness and test coverage. Studies the Spark implementation across versions 3.4.3, 3.5.8, and 4.0.1, reviews the Comet and DataFusion implementations, identifies missing test coverage, and offers to implement additional tests.
3+
description: Audit an existing Comet expression for correctness and test coverage. Studies the Spark implementation across versions 3.4.3, 3.5.8, 4.0.1, and 4.1.1, reviews the Comet and DataFusion implementations, identifies missing test coverage, and offers to implement additional tests.
44
argument-hint: <expression-name>
55
---
66

@@ -10,7 +10,7 @@ Audit the Comet implementation of the `$ARGUMENTS` expression for correctness an
1010

1111
This audit covers:
1212

13-
1. Spark implementation across versions 3.4.3, 3.5.8, and 4.0.1
13+
1. Spark implementation across versions 3.4.3, 3.5.8, 4.0.1, and 4.1.1
1414
2. Comet Scala serde implementation
1515
3. Comet Rust / DataFusion implementation
1616
4. Existing test coverage (Comet SQL Tests and Comet Scala Tests)
@@ -24,7 +24,7 @@ Clone specific Spark version tags (use shallow clones to avoid polluting the wor
2424

2525
```bash
2626
set -eu -o pipefail
27-
for tag in v3.4.3 v3.5.8 v4.0.1; do
27+
for tag in v3.4.3 v3.5.8 v4.0.1 v4.1.1; do
2828
dir="/tmp/spark-${tag}"
2929
if [ ! -d "$dir" ]; then
3030
git clone --depth 1 --branch "$tag" https://github.com/apache/spark.git "$dir"
@@ -37,7 +37,7 @@ done
3737
Search the Catalyst SQL expressions source:
3838

3939
```bash
40-
for tag in v3.4.3 v3.5.8 v4.0.1; do
40+
for tag in v3.4.3 v3.5.8 v4.0.1 v4.1.1; do
4141
dir="/tmp/spark-${tag}"
4242
echo "=== $tag ==="
4343
find "$dir/sql/catalyst/src/main/scala" -name "*.scala" | \
@@ -48,7 +48,7 @@ done
4848
If the expression is not found in catalyst, also check core:
4949

5050
```bash
51-
for tag in v3.4.3 v3.5.8 v4.0.1; do
51+
for tag in v3.4.3 v3.5.8 v4.0.1 v4.1.1; do
5252
dir="/tmp/spark-${tag}"
5353
echo "=== $tag ==="
5454
find "$dir/sql" -name "*.scala" | \
@@ -73,6 +73,7 @@ Produce a concise diff summary of what changed between:
7373

7474
- 3.4.3 → 3.5.8
7575
- 3.5.8 → 4.0.1
76+
- 4.0.1 → 4.1.1
7677

7778
Pay attention to:
7879

@@ -87,7 +88,7 @@ Pay attention to:
8788
## Step 2: Locate the Spark Tests
8889

8990
```bash
90-
for tag in v3.4.3 v3.5.8 v4.0.1; do
91+
for tag in v3.4.3 v3.5.8 v4.0.1 v4.1.1; do
9192
dir="/tmp/spark-${tag}"
9293
echo "=== $tag ==="
9394
find "$dir/sql" -name "*.scala" -path "*/test/*" | \
@@ -208,6 +209,11 @@ and in EXPLAIN output, so they are user-facing.
208209
- Use backticks around config keys, type names, and SQL identifiers.
209210
- Link to a tracking GitHub issue for known incompatibilities so users can
210211
follow progress: `(https://github.com/apache/datafusion-comet/issues/NNNN)`.
212+
**Verify the issue exists and is open** before citing it
213+
(`gh issue view <N> --repo apache/datafusion-comet`). Issue numbers
214+
invented from context or recalled from memory are a recurring failure
215+
mode: a stale link is worse than no link because the reader follows it
216+
and finds nothing.
211217
- Keep it concise. Single sentence is best.
212218
- Do not write "Incompatible reason: ..." or "Unsupported because ...".
213219
The doc generator adds the framing.
@@ -389,6 +395,49 @@ finding for Step 6.
389395
read like internal implementation notes ("DataFusion probes the longer
390396
side") or that mismatch their support level (an "Incompatible" reason
391397
that says "X is not supported").
398+
10. **Expression-shape restrictions live in `getSupportLevel`.** Any
399+
restriction that is knowable from the expression alone (literal-only
400+
arguments, unsupported child data type, foldable-only options, a
401+
specific operator shape) must be declared as an
402+
`Unsupported(Some(reason))` branch in `getSupportLevel`, not gated
403+
inside `convert` with a `withInfo` + `return None`. Putting the
404+
check in `convert` means EXPLAIN surfaces the reason only at
405+
conversion time, the doc generator never sees it, and the
406+
dispatcher cannot route around it. The literal-only `len`
407+
restrictions on `CometLeft`, `CometRight`, and `CometSubstring`
408+
are the canonical example of the in-`convert` pattern that this
409+
skill forbids: lift them into `getSupportLevel`.
410+
11. **Spark 4.0 collation divergences are flagged, not glossed over.**
411+
If the Spark 4.0/4.1 implementation routes through
412+
`CollationSupport.X.exec(..., collationId)` (or uses
413+
`StringTypeWithCollation` / `StringTypeNonCSAICollation` for input
414+
types) and the Comet path does not propagate collation, the
415+
expression is `Incompatible` for non-default collations. Mark the
416+
branch `Incompatible(Some(reason))` linking to the collation
417+
umbrella issue
418+
(https://github.com/apache/datafusion-comet/issues/4496) so the
419+
follow-up sweep can find every site. "Behaviour unchanged for
420+
`UTF8_BINARY`" alone is not a justification for leaving the
421+
support level at `Compatible`: users running with non-default
422+
collations get silently wrong answers.
423+
12. **Known divergences flip the support level.** If you find yourself
424+
writing the words "Known divergence" or "Known limitation" in the
425+
support-doc sub-bullet while leaving `getSupportLevel` returning
426+
`Compatible`, the audit has skipped its job. A documented
427+
divergence is by definition not `Compatible`. Promote the branch
428+
to `Incompatible(Some(reason))` (or `Unsupported` if the native
429+
path errors rather than producing a wrong answer) and link the
430+
tracking issue. The `replace` empty-search-string divergence with
431+
DataFusion is the canonical example of this anti-pattern.
432+
13. **Unreachable serde mappings are removed.** Expressions registered
433+
as `RuntimeReplaceable` (or otherwise rewritten by an analyzer
434+
rule before serde) never reach `QueryPlanSerde.exprToProtoInternal`
435+
with their original class. If the audit finds that a registered
436+
`CometScalarFunction("name")` or `CometExpressionSerde` entry can
437+
never be hit (e.g. the `btrim` mapping for `StringTrimBoth`, which
438+
is rewritten to `StringTrim` before serde runs), delete the
439+
registration in the same audit PR. Documenting the dead code in
440+
the support doc is not enough.
392441

393442
---
394443

@@ -411,6 +460,11 @@ For an untested case, run it manually first to determine the current
411460
behaviour, then commit either a regression test (passes) or a
412461
`query ignore(<issue-url>)` test (fails).
413462

463+
Every item in this bucket either becomes an inline fix + test, or a
464+
filed GitHub issue + ignored regression test, in the audit PR. Step 7
465+
spells out the workflow: never leave a high-priority finding as PR-body
466+
prose only.
467+
414468
### Medium priority: missing test coverage
415469

416470
Low-risk coverage gaps: additional input permutations on already-tested
@@ -429,9 +483,13 @@ etc. These come from the Step 5 consistency audit.
429483

430484
High-priority findings (correctness divergences and high-risk coverage
431485
gaps) and consistency issues from Step 5 / Step 6 must not be left as
432-
prose. Apply them in the same PR as the audit. Only low-risk missing
433-
coverage requires the user's go-ahead, because adding tests for cases
434-
that already work on well-exercised paths is incremental polish.
486+
prose. Apply them in the same PR as the audit. Anything you cannot fix
487+
inline (because it needs a semantics decision, native code change, or
488+
larger design work) must still be captured as a GitHub issue per the
489+
"Findings that need follow-up" section below: prose recommendations in
490+
the PR body alone are insufficient. Only low-risk missing coverage
491+
requires the user's go-ahead, because adding tests for cases that
492+
already work on well-exercised paths is incremental polish.
435493

436494
### High-priority findings: capture as tests
437495

@@ -499,16 +557,80 @@ need user approval. The classes of fix are:
499557
- Hoist a reason shared by multiple serdes (e.g. a recurring
500558
TimestampNTZ caveat) into a small `private object` companion in the
501559
same file, mirroring `UTCTimestampSerde`.
560+
- Lift expression-shape restrictions (literal-only argument, foldable
561+
child, unsupported child data type) out of `convert`'s `withInfo` +
562+
`return None` and into an `Unsupported(Some(reason))` branch in
563+
`getSupportLevel`. The `convert` body should then assume the
564+
precondition holds and stop calling `withInfo` for that case.
565+
- Promote a documented "Known divergence" or "Known limitation" sub-
566+
bullet from a `Compatible` branch to `Incompatible(Some(reason))`
567+
(or `Unsupported` if the native path errors) and link the tracking
568+
issue. The sub-bullet stays as user-facing documentation. The
569+
support level catches up to match.
570+
- Mark expressions whose Spark 4.0+ path routes through
571+
`CollationSupport.X.exec` (or accepts `StringTypeWithCollation` /
572+
`StringTypeNonCSAICollation`) as `Incompatible(Some(reason))` for
573+
non-default collations, linking
574+
https://github.com/apache/datafusion-comet/issues/4496.
575+
- Delete unreachable serde registrations (`RuntimeReplaceable` rewrites
576+
the expression before serde runs, an analyzer rule strips the case,
577+
etc.) rather than documenting them as a curiosity.
502578

503579
Each fix is one of these patterns. If a finding requires a semantics
504580
decision (e.g. is a specific branch really `Unsupported`, or is it
505-
`Incompatible`?), do not guess: leave it as a prose recommendation in
506-
the PR description and call it out for the reviewer.
581+
`Incompatible`?), do not guess: **file a GitHub issue per the
582+
"Findings that need follow-up" section below** and link it from the PR
583+
description. Do not leave the recommendation as prose only: prose in a
584+
PR description gets buried as soon as the PR merges.
507585

508586
After every fix, build the affected module to make sure the edit
509587
compiles. Do not run the full suite; targeted tests suffice if the
510588
fix could plausibly affect behaviour.
511589

590+
### Findings that need follow-up: always file a tracking issue
591+
592+
Any high-priority finding (correctness divergence, robustness gap,
593+
behavioural difference from Spark, missing-collation guard, etc.) that
594+
this PR does not fix inline **must** be filed as a GitHub issue before
595+
the PR is opened. This includes:
596+
597+
- Semantics decisions the audit surfaces but should not unilaterally
598+
resolve (e.g. promote `Compatible` to `Incompatible`, change a
599+
default).
600+
- Architectural concerns that span multiple expressions (e.g. Spark 4.0
601+
collation propagation across an entire family).
602+
- Bugs that are fixable in principle but need more design or native
603+
changes than fit in the audit PR.
604+
- Documentation gaps surfaced by the audit (e.g. an expression that
605+
doesn't appear in the auto-generated compatibility doc).
606+
607+
For each follow-up:
608+
609+
1. Search for an existing issue first
610+
(`gh issue list --search "<expression> <symptom> in:title,body" --state all --limit 5`).
611+
If a candidate match comes back, **open it
612+
(`gh issue view <N> --repo apache/datafusion-comet`) and confirm the
613+
title and body actually describe the divergence you found, and that
614+
the issue is still `OPEN`**. A closed-but-fixed issue cited as
615+
"known divergence" is worse than no citation, because the reader
616+
follows the link and finds a fix that was already shipped. If it
617+
matches, link it from the PR description and the support-doc
618+
sub-bullet, and stop.
619+
2. If no issue exists, file one with `gh issue create` using the
620+
`correctness` label (or `documentation` for doc-only gaps) plus any
621+
relevant area labels (e.g. `spark 4.0`). Title format:
622+
`[Bug] <expression> <one-line symptom>`. Body includes: Spark version
623+
range affected, a minimal repro, the divergent result, the relevant
624+
Comet file/line, and a one-line note that the issue was surfaced by
625+
this audit PR.
626+
3. Reference the issue number from both the support-doc sub-bullet and
627+
the PR description "What changes are included" section so reviewers
628+
can see what work the audit intentionally deferred.
629+
630+
Prose-only "future work" notes in the PR description are not enough.
631+
The whole point of the audit is to leave behind durable artefacts; an
632+
unfiled finding evaporates after the PR merges.
633+
512634
### Low-risk missing test coverage: ask the user
513635

514636
This is the only Step 7 category that pauses for user input. It only
@@ -592,7 +714,7 @@ used for the expression in `docs/source/user-guide/latest/expressions.md`.
592714
other pages.
593715
- Add (or update) a `## <function_name>` section, keeping sections alphabetically ordered.
594716
- Under that heading, add one bullet per Spark version checked, each including:
595-
- Spark version (e.g. 3.4.3, 3.5.8, 4.0.1)
717+
- Spark version (e.g. 3.4.3, 3.5.8, 4.0.1, 4.1.1)
596718
- Today's date
597719
- A brief note for any version-specific finding (behavioral difference, known
598720
incompatibility); omit the note if nothing notable.
@@ -604,7 +726,7 @@ used for the expression in `docs/source/user-guide/latest/expressions.md`.
604726
Present the audit as:
605727

606728
1. **Expression Summary** - Brief description of what `$ARGUMENTS` does, its input/output types, and null behavior
607-
2. **Spark Version Differences** - Summary of any behavioral or API differences across Spark 3.4.3, 3.5.8, and 4.0.1
729+
2. **Spark Version Differences** - Summary of any behavioral or API differences across Spark 3.4.3, 3.5.8, 4.0.1, and 4.1.1
608730
3. **Comet Implementation Notes** - Summary of how Comet implements this expression and any concerns
609731
4. **Coverage Gap Analysis** - The gap table from Step 5, plus implementation gaps
610732
5. **Recommendations** - Prioritized list from Step 6

.github/workflows/ci.yml

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,13 @@
2121

2222
name: CI
2323

24+
# A `labeled` event (e.g. the run-spark-*-tests gates, or dependabot's automatic
25+
# `dependencies` label added ~1s after open) fires at the same commit as the
26+
# opened/synchronize run. Keying the group on the label name keeps labeled runs
27+
# in their own subgroup so they never cancel the real commit run; opened and
28+
# synchronize both map to `commit` so a new push still supersedes its predecessor.
2429
concurrency:
25-
group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{ github.workflow }}
30+
group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{ github.workflow }}-${{ github.event.action == 'labeled' && github.event.label.name || 'commit' }}
2631
cancel-in-progress: true
2732

2833
on:
@@ -44,6 +49,14 @@ jobs:
4449
preflight:
4550
name: Preflight
4651
runs-on: ubuntu-slim
52+
# On a `labeled` event, only proceed for the gating labels. Any other label
53+
# (e.g. dependabot's `dependencies`) skips the whole pipeline rather than
54+
# spawning a redundant run alongside the opened/synchronize one.
55+
if: >-
56+
github.event_name != 'pull_request' ||
57+
github.event.action != 'labeled' ||
58+
github.event.label.name == 'run-spark-3.4-tests' ||
59+
github.event.label.name == 'run-spark-4.1-tests'
4760
steps:
4861
- uses: actions/checkout@v6
4962

.github/workflows/codeql.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,11 +49,11 @@ jobs:
4949
persist-credentials: false
5050

5151
- name: Initialize CodeQL
52-
uses: github/codeql-action/init@7211b7c8077ea37d8641b6271f6a365a22a5fbfa # v4
52+
uses: github/codeql-action/init@8aad20d150bbac5944a9f9d289da16a4b0d87c1e # v4
5353
with:
5454
languages: actions
5555

5656
- name: Perform CodeQL Analysis
57-
uses: github/codeql-action/analyze@7211b7c8077ea37d8641b6271f6a365a22a5fbfa # v4
57+
uses: github/codeql-action/analyze@8aad20d150bbac5944a9f9d289da16a4b0d87c1e # v4
5858
with:
5959
category: "/language:actions"

.github/workflows/pr_build_linux.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ jobs:
5959
- uses: actions/checkout@v6
6060

6161
- name: Setup coursier
62-
uses: coursier/setup-action@v1
62+
uses: coursier/setup-action@v3
6363
with:
6464
jvm: temurin:21
6565

@@ -361,6 +361,7 @@ jobs:
361361
org.apache.spark.sql.CometToPrettyStringSuite
362362
org.apache.spark.sql.CometCollationSuite
363363
org.apache.comet.CometFuzzAggregateSuite
364+
org.apache.spark.sql.comet.execution.arrow.CometArrowStreamSuite
364365
- name: "expressions"
365366
value: |
366367
org.apache.comet.CometExpressionSuite
@@ -377,15 +378,19 @@ jobs:
377378
org.apache.comet.CometMapExpressionSuite
378379
org.apache.comet.CometCsvExpressionSuite
379380
org.apache.comet.CometJsonExpressionSuite
381+
org.apache.comet.CometJsonJvmSuite
380382
org.apache.comet.SparkErrorConverterSuite
381383
org.apache.comet.expressions.conditional.CometIfSuite
382384
org.apache.comet.expressions.conditional.CometCoalesceSuite
383385
org.apache.comet.expressions.conditional.CometCaseWhenSuite
386+
org.apache.comet.CometRegExpJvmSuite
384387
org.apache.comet.CometCodegenSuite
385388
org.apache.comet.CometCodegenSourceSuite
386389
org.apache.comet.CometCodegenHOFSuite
387390
org.apache.comet.CometFuzzMathSuite
388391
org.apache.comet.CometCodegenFuzzSuite
392+
org.apache.comet.CometStringDecodeSuite
393+
org.apache.comet.CometWidthBucketSuite
389394
fail-fast: false
390395
name: ${{ matrix.profile.name }} [${{ matrix.suite.name }}]
391396
runs-on: ubuntu-24.04

.github/workflows/pr_build_macos.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,7 @@ jobs:
177177
org.apache.spark.sql.CometToPrettyStringSuite
178178
org.apache.spark.sql.CometCollationSuite
179179
org.apache.comet.CometFuzzAggregateSuite
180+
org.apache.spark.sql.comet.execution.arrow.CometArrowStreamSuite
180181
- name: "expressions"
181182
value: |
182183
org.apache.comet.CometExpressionSuite
@@ -193,15 +194,19 @@ jobs:
193194
org.apache.comet.CometMapExpressionSuite
194195
org.apache.comet.CometCsvExpressionSuite
195196
org.apache.comet.CometJsonExpressionSuite
197+
org.apache.comet.CometJsonJvmSuite
196198
org.apache.comet.SparkErrorConverterSuite
197199
org.apache.comet.expressions.conditional.CometIfSuite
198200
org.apache.comet.expressions.conditional.CometCoalesceSuite
199201
org.apache.comet.expressions.conditional.CometCaseWhenSuite
202+
org.apache.comet.CometRegExpJvmSuite
200203
org.apache.comet.CometCodegenSuite
201204
org.apache.comet.CometCodegenSourceSuite
202205
org.apache.comet.CometCodegenHOFSuite
203206
org.apache.comet.CometFuzzMathSuite
204207
org.apache.comet.CometCodegenFuzzSuite
208+
org.apache.comet.CometStringDecodeSuite
209+
org.apache.comet.CometWidthBucketSuite
205210
206211
fail-fast: false
207212
name: ${{ matrix.os }}/${{ matrix.profile.name }} [${{ matrix.suite.name }}]

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,4 @@ output
2727
docs/comet-*/
2828
docs/build/
2929
docs/temp/
30+
docs/superpowers/

0 commit comments

Comments
 (0)