Skip to content

Commit d7b0de9

Browse files
authored
Merge branch 'main' into arrays_zip
2 parents f7211e1 + bfdc0f6 commit d7b0de9

35 files changed

Lines changed: 694 additions & 132 deletions

File tree

Lines changed: 297 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
---
2+
name: review-comet-pr
3+
description: Review a DataFusion Comet pull request for Spark compatibility and implementation correctness. Provides guidance to a reviewer rather than posting comments directly.
4+
argument-hint: <pr-number>
5+
---
6+
7+
Review Comet PR #$ARGUMENTS
8+
9+
## Before You Start
10+
11+
### Gather PR Metadata
12+
13+
Fetch the PR details to understand the scope:
14+
15+
```bash
16+
gh pr view $ARGUMENTS --repo apache/datafusion-comet --json title,body,author,isDraft,state,files
17+
```
18+
19+
### Review Existing Comments First
20+
21+
Before forming your review:
22+
23+
1. **Read all existing review comments** on the PR
24+
2. **Check the conversation tab** for any discussion
25+
3. **Avoid duplicating feedback** that others have already provided
26+
4. **Build on existing discussions** rather than starting new threads on the same topic
27+
5. **If you have no additional concerns beyond what's already discussed, say so**
28+
6. **Ignore Copilot reviews** - do not reference or build upon comments from GitHub Copilot
29+
30+
```bash
31+
# View existing comments on a PR
32+
gh pr view $ARGUMENTS --repo apache/datafusion-comet --comments
33+
```
34+
35+
---
36+
37+
## Review Workflow
38+
39+
### 1. Gather Context
40+
41+
Read the changed files and understand the area of the codebase being modified:
42+
43+
```bash
44+
# View the diff
45+
gh pr diff $ARGUMENTS --repo apache/datafusion-comet
46+
```
47+
48+
For expression PRs, check how similar expressions are implemented in the codebase. Look at the serde files in `spark/src/main/scala/org/apache/comet/serde/` and Rust implementations in `native/spark-expr/src/`.
49+
50+
### 2. Read Spark Source (Expression PRs)
51+
52+
**For any PR that adds or modifies an expression, you must read the Spark source code to understand the canonical behavior.** This is the authoritative reference for what Comet must match.
53+
54+
1. **Clone or update the Spark repo:**
55+
56+
```bash
57+
# Clone if not already present (use /tmp to avoid polluting the workspace)
58+
if [ ! -d /tmp/spark ]; then
59+
git clone --depth 1 https://github.com/apache/spark.git /tmp/spark
60+
fi
61+
```
62+
63+
2. **Find the expression implementation in Spark:**
64+
65+
```bash
66+
# Search for the expression class (e.g., for "Conv", "Hex", "Substring")
67+
find /tmp/spark/sql/catalyst/src/main/scala -name "*.scala" | xargs grep -l "case class <ExpressionName>"
68+
```
69+
70+
3. **Read the Spark implementation carefully.** Pay attention to:
71+
- The `eval` and `doGenEval`/`nullSafeEval` methods. These define the exact behavior.
72+
- The `inputTypes` and `dataType` fields. These define which types Spark accepts and what it returns.
73+
- Null handling. Does it use `nullable = true`? Does `nullSafeEval` handle nulls implicitly?
74+
- Special cases, guards, and `require` assertions.
75+
- ANSI mode branches (look for `SQLConf.get.ansiEnabled` or `failOnError`).
76+
77+
4. **Read the Spark tests for the expression:**
78+
79+
```bash
80+
# Find test files
81+
find /tmp/spark/sql -name "*.scala" -path "*/test/*" | xargs grep -l "<ExpressionName>"
82+
```
83+
84+
5. **Compare the Spark behavior against the Comet implementation in the PR.** Identify:
85+
- Edge cases tested in Spark but not in the PR
86+
- Data types supported in Spark but not handled in the PR
87+
- Behavioral differences that should be marked `Incompatible`
88+
89+
6. **Suggest additional tests** for any edge cases or type combinations covered in Spark's tests that are missing from the PR's tests.
90+
91+
### 3. Spark Compatibility Check
92+
93+
**This is the most critical aspect of Comet reviews.** Comet must produce identical results to Spark.
94+
95+
For expression PRs, verify against the Spark source you read in step 2:
96+
97+
1. **Check edge cases**
98+
- Null handling
99+
- Overflow behavior
100+
- Empty input behavior
101+
- Type-specific behavior
102+
103+
2. **Verify all data types are handled**
104+
- Does Spark support this type? (Check `inputTypes` in Spark source)
105+
- Does the PR handle all Spark-supported types?
106+
107+
3. **Check for ANSI mode differences**
108+
- Spark behavior may differ between legacy and ANSI modes
109+
- PR should handle both or mark as `Incompatible`
110+
111+
### 4. Check Against Implementation Guidelines
112+
113+
**Always verify PRs follow the implementation guidelines.**
114+
115+
#### Scala Serde (`spark/src/main/scala/org/apache/comet/serde/`)
116+
117+
- [ ] Expression class correctly identified
118+
- [ ] All child expressions converted via `exprToProtoInternal`
119+
- [ ] Return type correctly serialized
120+
- [ ] `getSupportLevel` reflects true compatibility:
121+
- `Compatible()` - matches Spark exactly
122+
- `Incompatible(Some("reason"))` - differs in documented ways
123+
- `Unsupported(Some("reason"))` - cannot be implemented
124+
- [ ] Serde in appropriate file (`datetime.scala`, `strings.scala`, `arithmetic.scala`, etc.)
125+
126+
#### Registration (`QueryPlanSerde.scala`)
127+
128+
- [ ] Added to correct map (temporal, string, arithmetic, etc.)
129+
- [ ] No duplicate registrations
130+
- [ ] Import statement added
131+
132+
#### Rust Implementation (if applicable)
133+
134+
Location: `native/spark-expr/src/`
135+
136+
- [ ] Matches DataFusion and Arrow conventions
137+
- [ ] Null handling is correct
138+
- [ ] No panics. Use `Result` types.
139+
- [ ] Efficient array operations (avoid row-by-row)
140+
141+
#### Tests - Prefer SQL File-Based Framework
142+
143+
**Expression tests should use the SQL file-based framework (`CometSqlFileTestSuite`) where possible.** This framework automatically runs each query through both Spark and Comet and compares results. No Scala code is needed. Only fall back to Scala tests in `CometExpressionSuite` when the SQL framework cannot express the test. Examples include complex `DataFrame` setup, programmatic data generation, or non-expression tests.
144+
145+
**SQL file test location:** `spark/src/test/resources/sql-tests/expressions/<category>/`
146+
147+
Categories include: `aggregate/`, `array/`, `string/`, `math/`, `struct/`, `map/`, `datetime/`, `hash/`, etc.
148+
149+
**SQL file structure:**
150+
151+
```sql
152+
-- ConfigMatrix: parquet.enable.dictionary=false,true
153+
154+
-- Create test data
155+
statement
156+
CREATE TABLE test_crc32(col string, a int, b float) USING parquet
157+
158+
statement
159+
INSERT INTO test_crc32 VALUES ('Spark', 10, 1.5), (NULL, NULL, NULL), ('', 0, 0.0)
160+
161+
-- Default mode: verifies native Comet execution + result matches Spark
162+
query
163+
SELECT crc32(col) FROM test_crc32
164+
165+
-- spark_answer_only: compares results without requiring native execution
166+
query spark_answer_only
167+
SELECT crc32(cast(a as string)) FROM test_crc32
168+
169+
-- tolerance: allows numeric variance for floating-point results
170+
query tolerance=0.0001
171+
SELECT cos(v) FROM test_trig
172+
173+
-- expect_fallback: asserts fallback to Spark occurs
174+
query expect_fallback(unsupported expression)
175+
SELECT unsupported_func(v) FROM test_table
176+
177+
-- expect_error: verifies both engines throw matching exceptions
178+
query expect_error(ARITHMETIC_OVERFLOW)
179+
SELECT 2147483647 + 1
180+
181+
-- ignore: skip queries with known bugs (include GitHub issue link)
182+
query ignore(https://github.com/apache/datafusion-comet/issues/NNNN)
183+
SELECT known_buggy_expr(v) FROM test_table
184+
```
185+
186+
**Running SQL file tests:**
187+
188+
```bash
189+
# All SQL file tests
190+
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite" -Dtest=none
191+
192+
# Specific test file (substring match)
193+
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite crc32" -Dtest=none
194+
```
195+
196+
**CRITICAL: Verify all test requirements (regardless of framework):**
197+
198+
- [ ] Basic functionality tested (column data, not just literals)
199+
- [ ] Null handling tested (`SELECT expression(NULL)`)
200+
- [ ] Edge cases tested (empty input, overflow, boundary values)
201+
- [ ] Both literal values and column references tested (they use different code paths)
202+
- [ ] For timestamp/datetime expressions, timezone handling is tested (e.g., UTC, non-UTC session timezone, timestamps with and without timezone)
203+
- [ ] One expression per SQL file for easier debugging
204+
- [ ] If using Scala tests instead, literal tests MUST disable constant folding:
205+
```scala
206+
withSQLConf(SQLConf.OPTIMIZER_EXCLUDED_RULES.key ->
207+
"org.apache.spark.sql.catalyst.optimizer.ConstantFolding") {
208+
checkSparkAnswerAndOperator("SELECT func(literal)")
209+
}
210+
```
211+
212+
### 5. Performance Review (Expression PRs)
213+
214+
**For PRs that add new expressions, performance is not optional.** The whole point of Comet is to be faster than Spark. If a new expression is not faster, it may not be worth adding.
215+
216+
1. **Check that the PR includes microbenchmark results.** The PR description should contain benchmark numbers comparing Comet vs Spark for the new expression. If benchmark results are missing, flag this as a required addition.
217+
218+
2. **Look for a microbenchmark implementation.** Expression benchmarks live in `spark/src/test/scala/org/apache/spark/sql/benchmark/`. Check whether the PR adds a benchmark for the new expression.
219+
220+
3. **Review the benchmark results if provided:**
221+
- Is Comet actually faster than Spark for this expression?
222+
- Are the benchmarks representative? They should test with realistic data sizes, not just trivial inputs.
223+
- Are different data types benchmarked if the expression supports multiple types?
224+
225+
4. **Review the Rust implementation for performance concerns:**
226+
- Unnecessary allocations or copies
227+
- Row-by-row processing where batch/array operations are possible
228+
- Redundant type conversions
229+
- Inefficient string handling (e.g., repeated UTF-8 validation)
230+
- Missing use of Arrow compute kernels where they exist
231+
232+
5. **If benchmark results show Comet is slower than Spark**, flag this clearly. The PR should explain why the regression is acceptable or include a plan to optimize.
233+
234+
### 6. Check CI Test Failures
235+
236+
**Always check the CI status and summarize any test failures in your review.**
237+
238+
```bash
239+
# View CI check status
240+
gh pr checks $ARGUMENTS --repo apache/datafusion-comet
241+
242+
# View failed check details
243+
gh pr checks $ARGUMENTS --repo apache/datafusion-comet --failed
244+
```
245+
246+
### 7. Documentation Check
247+
248+
Check whether the PR requires updates to user-facing documentation in `docs/`:
249+
250+
- **Compatibility guide** (`docs/source/user-guide/compatibility.md`): New expressions or operators should be listed. Incompatible behaviors should be documented.
251+
- **Configuration guide** (`docs/source/user-guide/configs.md`): New config options should be documented.
252+
- **Expressions list** (`docs/source/user-guide/expressions.md`): New expressions should be added.
253+
254+
If the PR adds a new expression or operator but does not update the relevant docs, flag this as something that needs to be addressed.
255+
256+
### 8. Common Comet Review Issues
257+
258+
1. **Incomplete type support**: Spark expression supports types not handled in PR
259+
2. **Missing edge cases**: Null, overflow, empty string, negative values
260+
3. **Wrong return type**: Return type must match Spark exactly
261+
4. **Tests in wrong framework**: Expression tests should use the SQL file-based framework (`CometSqlFileTestSuite`) rather than adding to Scala test suites like `CometExpressionSuite`. Suggest migration if the PR adds Scala tests for expressions that could use SQL files instead.
262+
5. **Stale native code**: PR might need `./mvnw install -pl common -DskipTests`
263+
6. **Missing `getSupportLevel`**: Edge cases should be marked as `Incompatible`
264+
265+
---
266+
267+
## Output Format
268+
269+
Present your review as guidance for the reviewer. Structure your output as:
270+
271+
1. **PR Summary** - Brief description of what the PR does
272+
2. **CI Status** - Summary of CI check results
273+
3. **Findings** - Your analysis organized by area (Spark compatibility, implementation, tests, etc.)
274+
4. **Suggested Review Comments** - Specific comments the reviewer could leave on the PR, with file and line references where applicable
275+
276+
## Review Tone and Style
277+
278+
Write reviews that sound human and conversational. Avoid:
279+
280+
- Robotic or formulaic language
281+
- Em dashes. Use separate sentences instead.
282+
- Semicolons. Use separate sentences instead.
283+
284+
Instead:
285+
286+
- Write in flowing paragraphs using simple grammar
287+
- Keep sentences short and separate rather than joining them with punctuation
288+
- Be kind and constructive, even when raising concerns
289+
- Use backticks around any code references (function names, file paths, class names, types, config keys, etc.)
290+
- **Suggest** adding tests rather than stating tests are missing (e.g., "It might be worth adding a test for X" not "Tests are missing for X")
291+
- **Ask questions** about edge cases rather than asserting they aren't handled (e.g., "Does this handle the case where X is null?" not "This doesn't handle null")
292+
- Frame concerns as questions or suggestions when possible
293+
- Acknowledge what the PR does well before raising concerns
294+
295+
## Do Not Post Comments
296+
297+
**IMPORTANT: Never post comments or reviews on the PR directly.** This skill is for providing guidance to a human reviewer. Present all findings and suggested comments to the user. The user will decide what to post.

common/pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ under the License.
2626
<parent>
2727
<groupId>org.apache.datafusion</groupId>
2828
<artifactId>comet-parent-spark${spark.version.short}_${scala.binary.version}</artifactId>
29-
<version>0.14.0-SNAPSHOT</version>
29+
<version>0.15.0-SNAPSHOT</version>
3030
<relativePath>../pom.xml</relativePath>
3131
</parent>
3232

dev/diffs/3.4.3.diff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ index d3544881af1..9c16099090c 100644
77
<ivy.version>2.5.1</ivy.version>
88
<oro.version>2.0.8</oro.version>
99
+ <spark.version.short>3.4</spark.version.short>
10-
+ <comet.version>0.14.0-SNAPSHOT</comet.version>
10+
+ <comet.version>0.15.0-SNAPSHOT</comet.version>
1111
<!--
1212
If you changes codahale.metrics.version, you also need to change
1313
the link to metrics.dropwizard.io in docs/monitoring.md.

0 commit comments

Comments
 (0)