Skip to content

Commit 70c8daf

Browse files
authored
Merge branch 'main' into fix/3429-spark4-ctas-union-native-writer
2 parents 0f535e3 + 271a4f4 commit 70c8daf

1,840 files changed

Lines changed: 90468 additions & 139788 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.asf.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ github:
4141
features:
4242
issues: true
4343
discussions: true
44+
projects: true
4445
protected_branches:
4546
main:
4647
required_pull_request_reviews:
Lines changed: 355 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,355 @@
1+
---
2+
name: audit-comet-expression
3+
description: Audit an existing Comet expression for correctness and test coverage. Studies the Spark implementation across versions 3.4.3, 3.5.8, and 4.0.1, reviews the Comet and DataFusion implementations, identifies missing test coverage, and offers to implement additional tests.
4+
argument-hint: <expression-name>
5+
---
6+
7+
Audit the Comet implementation of the `$ARGUMENTS` expression for correctness and test coverage.
8+
9+
## Overview
10+
11+
This audit covers:
12+
13+
1. Spark implementation across versions 3.4.3, 3.5.8, and 4.0.1
14+
2. Comet Scala serde implementation
15+
3. Comet Rust / DataFusion implementation
16+
4. Existing test coverage (Comet SQL Tests and Comet Scala Tests)
17+
5. Gap analysis and test recommendations
18+
19+
---
20+
21+
## Step 1: Locate the Spark Implementations
22+
23+
Clone specific Spark version tags (use shallow clones to avoid polluting the workspace). Only clone a version if it is not already present.
24+
25+
```bash
26+
set -eu -o pipefail
27+
for tag in v3.4.3 v3.5.8 v4.0.1; do
28+
dir="/tmp/spark-${tag}"
29+
if [ ! -d "$dir" ]; then
30+
git clone --depth 1 --branch "$tag" https://github.com/apache/spark.git "$dir"
31+
fi
32+
done
33+
```
34+
35+
### Find the expression class in each Spark version
36+
37+
Search the Catalyst SQL expressions source:
38+
39+
```bash
40+
for tag in v3.4.3 v3.5.8 v4.0.1; do
41+
dir="/tmp/spark-${tag}"
42+
echo "=== $tag ==="
43+
find "$dir/sql/catalyst/src/main/scala" -name "*.scala" | \
44+
xargs grep -l "case class $ARGUMENTS\b\|object $ARGUMENTS\b" 2>/dev/null
45+
done
46+
```
47+
48+
If the expression is not found in catalyst, also check core:
49+
50+
```bash
51+
for tag in v3.4.3 v3.5.8 v4.0.1; do
52+
dir="/tmp/spark-${tag}"
53+
echo "=== $tag ==="
54+
find "$dir/sql" -name "*.scala" | \
55+
xargs grep -l "case class $ARGUMENTS\b\|object $ARGUMENTS\b" 2>/dev/null
56+
done
57+
```
58+
59+
### Read the Spark source for each version
60+
61+
For each Spark version, read the expression file and note:
62+
63+
- The `eval`, `nullSafeEval`, and `doGenCode` / `doGenCodeSafe` methods
64+
- The `inputTypes` and `dataType` fields (accepted input types, return type)
65+
- Null handling strategy (`nullable`, `nullSafeEval`)
66+
- ANSI mode behavior (`ansiEnabled`, `failOnError`)
67+
- Special cases, guards, `require` assertions, and runtime exceptions
68+
- Any constants or configuration the expression reads
69+
70+
### Compare across Spark versions
71+
72+
Produce a concise diff summary of what changed between:
73+
74+
- 3.4.3 → 3.5.8
75+
- 3.5.8 → 4.0.1
76+
77+
Pay attention to:
78+
79+
- New input types added or removed
80+
- Behavior changes for edge cases (null, overflow, empty, boundary)
81+
- New ANSI mode branches
82+
- New parameters or configuration
83+
- Breaking API changes that Comet must shim
84+
85+
---
86+
87+
## Step 2: Locate the Spark Tests
88+
89+
```bash
90+
for tag in v3.4.3 v3.5.8 v4.0.1; do
91+
dir="/tmp/spark-${tag}"
92+
echo "=== $tag ==="
93+
find "$dir/sql" -name "*.scala" -path "*/test/*" | \
94+
xargs grep -l "$ARGUMENTS" 2>/dev/null
95+
done
96+
```
97+
98+
Read the relevant Spark test files and produce a list of:
99+
100+
- Input types covered
101+
- Edge cases exercised (null, empty, overflow, negative, boundary values, special characters, etc.)
102+
- ANSI mode tests
103+
- Error cases
104+
105+
This list will be the reference for the coverage gap analysis in Step 5.
106+
107+
---
108+
109+
## Step 3: Locate the Comet Implementation
110+
111+
### Scala serde
112+
113+
```bash
114+
# Find the serde object
115+
grep -r "$ARGUMENTS" spark/src/main/scala/org/apache/comet/serde/ --include="*.scala" -l
116+
grep -r "$ARGUMENTS" spark/src/main/scala/org/apache/comet/ --include="*.scala" -l
117+
```
118+
119+
Read the serde implementation and check:
120+
121+
- Which Spark versions the serde handles
122+
- Whether `getSupportLevel` is implemented and accurate
123+
- Whether all input types are handled
124+
- Whether any types are explicitly marked `Unsupported`
125+
- Whether `getIncompatibleReasons()` and `getUnsupportedReasons()` are overridden.
126+
`getSupportLevel` controls runtime fallback, but `GenerateDocs` reads these two
127+
methods to build the Compatibility Guide. If `getSupportLevel` returns
128+
`Incompatible(Some(...))` or `Unsupported(Some(...))` but the corresponding
129+
`get*Reasons()` method is not overridden, the reason will not appear in the
130+
published docs. Expect both to return a `Seq[String]` containing the same
131+
reason text used in `getSupportLevel`. Follow the pattern in
132+
`spark/src/main/scala/org/apache/comet/serde/structs.scala::CometStructsToJson`
133+
or `spark/src/main/scala/org/apache/comet/serde/datetime.scala::CometHour`:
134+
extract the reason as a `private val` and reference it from both places.
135+
136+
### Shims
137+
138+
```bash
139+
find spark/src/main -name "CometExprShim.scala" | xargs grep -l "$ARGUMENTS" 2>/dev/null
140+
```
141+
142+
If shims exist, read them and note any version-specific handling.
143+
144+
### Rust / DataFusion implementation
145+
146+
```bash
147+
# Search for the function in native/spark-expr
148+
grep -r "$ARGUMENTS" native/spark-expr/src/ --include="*.rs" -l
149+
grep -r "$ARGUMENTS" native/core/src/ --include="*.rs" -l
150+
```
151+
152+
If the expression delegates to DataFusion, find it there too. Set `$DATAFUSION_SRC` to a local DataFusion checkout, or fall back to searching the cargo registry:
153+
154+
```bash
155+
if [ -n "${DATAFUSION_SRC:-}" ]; then
156+
grep -r "$ARGUMENTS" "$DATAFUSION_SRC" --include="*.rs" -l 2>/dev/null | head -10
157+
else
158+
# Fall back to cargo registry (may include unrelated crates)
159+
grep -r "$ARGUMENTS" ~/.cargo/registry/src/*/datafusion* --include="*.rs" -l 2>/dev/null | head -10
160+
fi
161+
```
162+
163+
Read the Rust implementation and check:
164+
165+
- Null handling (does it propagate nulls correctly?)
166+
- Overflow and underflow handling (returns `Err` vs panics)
167+
- Type dispatch (does it handle all types that Spark supports?)
168+
- ANSI / fail-on-error mode
169+
170+
---
171+
172+
## Step 4: Locate Existing Comet Tests
173+
174+
### Comet SQL Tests
175+
176+
```bash
177+
# Find SQL test files for this expression
178+
find spark/src/test/resources/sql-tests/expressions/ -name "*.sql" | \
179+
xargs grep -l "$ARGUMENTS" 2>/dev/null
180+
181+
# Also check if there's a dedicated file
182+
find spark/src/test/resources/sql-tests/expressions/ -name "*$(echo $ARGUMENTS | tr '[:upper:]' '[:lower:]')*"
183+
```
184+
185+
Read every SQL test file found and list:
186+
187+
- Table schemas and data values used
188+
- Queries exercised
189+
- Query modes used (`query`, `spark_answer_only`, `tolerance`, `ignore`, `expect_error`)
190+
- Any ConfigMatrix directives
191+
192+
### Comet Scala Tests
193+
194+
```bash
195+
grep -r "$ARGUMENTS" spark/src/test/scala/ --include="*.scala" -l
196+
```
197+
198+
Read the relevant Comet Scala Tests and list:
199+
200+
- Input types covered
201+
- Edge cases exercised
202+
- Whether constant folding is disabled for literal tests
203+
204+
---
205+
206+
## Step 5: Gap Analysis
207+
208+
Compare the Spark test coverage (Step 2) against the Comet test coverage (Step 4). Produce a structured gap report:
209+
210+
### Coverage matrix
211+
212+
For each of the following dimensions, note whether it is covered in Comet tests or missing:
213+
214+
| Dimension | Spark tests it | Comet SQL Test | Comet Scala Test | Gap? |
215+
| ------------------------------------------------------------------------------------------------------ | -------------- | -------------- | ---------------- | ---- |
216+
| Column reference argument(s) | | | | |
217+
| Literal argument(s) | | | | |
218+
| NULL input | | | | |
219+
| Empty string / empty array / empty map | | | | |
220+
| Array/map with NULL elements | | | | |
221+
| Zero, negative zero, negative values (numeric) | | | | |
222+
| Underflow, overflow | | | | |
223+
| Boundary values (INT_MIN, INT_MAX, Long.MinValue, minimum positive, etc.) | | | | |
224+
| NaN, Infinity, -Infinity, subnormal (float/double) | | | | |
225+
| Multibyte / special UTF-8 (composed vs decomposed, e.g. `é` U+00E9 vs `e` + U+0301, non-Latin scripts) | | | | |
226+
| ANSI mode (failOnError=true) | | | | |
227+
| Non-ANSI mode (failOnError=false) | | | | |
228+
| All supported input types | | | | |
229+
| Parquet dictionary encoding (ConfigMatrix) | | | | |
230+
| Cross-version behavior differences | | | | |
231+
232+
### Implementation gaps
233+
234+
Also review the Comet implementation (Step 3) against the Spark behavior (Step 1):
235+
236+
- Are there input types that Spark supports but `getSupportLevel` returns `Unsupported` without comment?
237+
- Are there behavioral differences that are NOT marked `Incompatible` but should be?
238+
- Are there behavioral differences between Spark versions that the Comet implementation does not account for (missing shim)?
239+
- Does the Rust implementation match the Spark behavior for all edge cases?
240+
- If `getSupportLevel` returns `Incompatible` or `Unsupported` with a reason, are `getIncompatibleReasons()` / `getUnsupportedReasons()` also overridden so the reason is picked up by `GenerateDocs` and appears in the Compatibility Guide?
241+
242+
---
243+
244+
## Step 6: Recommendations
245+
246+
Summarize findings as a prioritized list.
247+
248+
### High priority
249+
250+
Issues where Comet may silently produce wrong results compared to Spark.
251+
252+
### Medium priority
253+
254+
Missing test coverage for edge cases that could expose bugs.
255+
256+
### Low priority
257+
258+
Minor gaps, cosmetic improvements, or nice-to-have tests.
259+
260+
---
261+
262+
## Step 7: Offer to Implement Missing Tests
263+
264+
After presenting the gap analysis, ask the user:
265+
266+
> I found the following missing test cases. Would you like me to implement them?
267+
>
268+
> - [list each missing test case]
269+
>
270+
> I can add them as Comet SQL Tests in `spark/src/test/resources/sql-tests/expressions/<category>/$ARGUMENTS.sql`
271+
> (or as Comet Scala Tests in `CometExpressionSuite` for cases that require programmatic setup).
272+
273+
If the user says yes, implement the missing tests following the Comet SQL Tests format described in
274+
`docs/source/contributor-guide/sql-file-tests.md`. Prefer Comet SQL Tests over Comet Scala Tests.
275+
276+
### Comet SQL Tests template
277+
278+
```sql
279+
-- Licensed to the Apache Software Foundation (ASF) under one
280+
-- or more contributor license agreements. See the NOTICE file
281+
-- distributed with this work for additional information
282+
-- regarding copyright ownership. The ASF licenses this file
283+
-- to you under the Apache License, Version 2.0 (the
284+
-- "License"); you may not use this file except in compliance
285+
-- with the License. You may obtain a copy of the License at
286+
--
287+
-- http://www.apache.org/licenses/LICENSE-2.0
288+
--
289+
-- Unless required by applicable law or agreed to in writing,
290+
-- software distributed under the License is distributed on an
291+
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
292+
-- KIND, either express or implied. See the License for the
293+
-- specific language governing permissions and limitations
294+
-- under the License.
295+
296+
-- ConfigMatrix: parquet.enable.dictionary=false,true
297+
298+
statement
299+
CREATE TABLE test_$ARGUMENTS(...) USING parquet
300+
301+
statement
302+
INSERT INTO test_$ARGUMENTS VALUES
303+
(...),
304+
(NULL)
305+
306+
-- column argument
307+
query
308+
SELECT $ARGUMENTS(col) FROM test_$ARGUMENTS
309+
310+
-- literal arguments
311+
query
312+
SELECT $ARGUMENTS('value'), $ARGUMENTS(''), $ARGUMENTS(NULL)
313+
```
314+
315+
### Verify the tests pass
316+
317+
After implementing tests, tell the user how to run them:
318+
319+
```bash
320+
./mvnw test -DwildcardSuites="CometSqlFileTestSuite" -Dsuites="org.apache.comet.CometSqlFileTestSuite $ARGUMENTS" -Dtest=none
321+
```
322+
323+
---
324+
325+
## Step 8: Update the Expression Support Doc
326+
327+
After completing the audit (whether or not tests were added), add sub-bullets under the expression's
328+
entry in `docs/source/contributor-guide/spark_expressions_support.md`.
329+
330+
Add one sub-bullet per Spark version checked, each including:
331+
332+
- Spark version (e.g. 3.4.3, 3.5.8, 4.0.1)
333+
- Today's date
334+
- A brief note for any version-specific finding (behavioral difference, known incompatibility); omit if nothing notable
335+
336+
---
337+
338+
## Output Format
339+
340+
Present the audit as:
341+
342+
1. **Expression Summary** - Brief description of what `$ARGUMENTS` does, its input/output types, and null behavior
343+
2. **Spark Version Differences** - Summary of any behavioral or API differences across Spark 3.4.3, 3.5.8, and 4.0.1
344+
3. **Comet Implementation Notes** - Summary of how Comet implements this expression and any concerns
345+
4. **Coverage Gap Analysis** - The gap table from Step 5, plus implementation gaps
346+
5. **Recommendations** - Prioritized list from Step 6
347+
6. **Offer to add tests** - The prompt from Step 7
348+
349+
## Tone and Style
350+
351+
- Write in clear, concise prose
352+
- Use backticks around code references (function names, file paths, class names, types, config keys)
353+
- Avoid robotic or formulaic language
354+
- Be constructive and acknowledge what is already well-covered before raising gaps
355+
- Avoid em dashes and semicolons; use separate sentences instead

0 commit comments

Comments
 (0)