|
| 1 | +--- |
| 2 | +name: review-comet-pr |
| 3 | +description: Review a DataFusion Comet pull request for Spark compatibility and implementation correctness. Provides guidance to a reviewer rather than posting comments directly. |
| 4 | +argument-hint: <pr-number> |
| 5 | +--- |
| 6 | + |
| 7 | +Review Comet PR #$ARGUMENTS |
| 8 | + |
| 9 | +## Before You Start |
| 10 | + |
| 11 | +### Gather PR Metadata |
| 12 | + |
| 13 | +Fetch the PR details to understand the scope: |
| 14 | + |
| 15 | +```bash |
| 16 | +gh pr view $ARGUMENTS --repo apache/datafusion-comet --json title,body,author,isDraft,state,files |
| 17 | +``` |
| 18 | + |
| 19 | +### Review Existing Comments First |
| 20 | + |
| 21 | +Before forming your review: |
| 22 | + |
| 23 | +1. **Read all existing review comments** on the PR |
| 24 | +2. **Check the conversation tab** for any discussion |
| 25 | +3. **Avoid duplicating feedback** that others have already provided |
| 26 | +4. **Build on existing discussions** rather than starting new threads on the same topic |
| 27 | +5. **If you have no additional concerns beyond what's already discussed, say so** |
| 28 | +6. **Ignore Copilot reviews** - do not reference or build upon comments from GitHub Copilot |
| 29 | + |
| 30 | +```bash |
| 31 | +# View existing comments on a PR |
| 32 | +gh pr view $ARGUMENTS --repo apache/datafusion-comet --comments |
| 33 | +``` |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +## Review Workflow |
| 38 | + |
| 39 | +### 1. Gather Context |
| 40 | + |
| 41 | +Read the changed files and understand the area of the codebase being modified: |
| 42 | + |
| 43 | +```bash |
| 44 | +# View the diff |
| 45 | +gh pr diff $ARGUMENTS --repo apache/datafusion-comet |
| 46 | +``` |
| 47 | + |
| 48 | +For expression PRs, check how similar expressions are implemented in the codebase. Look at the serde files in `spark/src/main/scala/org/apache/comet/serde/` and Rust implementations in `native/spark-expr/src/`. |
| 49 | + |
| 50 | +### 2. Read Spark Source (Expression PRs) |
| 51 | + |
| 52 | +**For any PR that adds or modifies an expression, you must read the Spark source code to understand the canonical behavior.** This is the authoritative reference for what Comet must match. |
| 53 | + |
| 54 | +1. **Clone or update the Spark repo:** |
| 55 | + |
| 56 | + ```bash |
| 57 | + # Clone if not already present (use /tmp to avoid polluting the workspace) |
| 58 | + if [ ! -d /tmp/spark ]; then |
| 59 | + git clone --depth 1 https://github.com/apache/spark.git /tmp/spark |
| 60 | + fi |
| 61 | + ``` |
| 62 | + |
| 63 | +2. **Find the expression implementation in Spark:** |
| 64 | + |
| 65 | + ```bash |
| 66 | + # Search for the expression class (e.g., for "Conv", "Hex", "Substring") |
| 67 | + find /tmp/spark/sql/catalyst/src/main/scala -name "*.scala" | xargs grep -l "case class <ExpressionName>" |
| 68 | + ``` |
| 69 | + |
| 70 | +3. **Read the Spark implementation carefully.** Pay attention to: |
| 71 | + - The `eval` and `doGenEval`/`nullSafeEval` methods. These define the exact behavior. |
| 72 | + - The `inputTypes` and `dataType` fields. These define which types Spark accepts and what it returns. |
| 73 | + - Null handling. Does it use `nullable = true`? Does `nullSafeEval` handle nulls implicitly? |
| 74 | + - Special cases, guards, and `require` assertions. |
| 75 | + - ANSI mode branches (look for `SQLConf.get.ansiEnabled` or `failOnError`). |
| 76 | + |
| 77 | +4. **Read the Spark tests for the expression:** |
| 78 | + |
| 79 | + ```bash |
| 80 | + # Find test files |
| 81 | + find /tmp/spark/sql -name "*.scala" -path "*/test/*" | xargs grep -l "<ExpressionName>" |
| 82 | + ``` |
| 83 | + |
| 84 | +5. **Compare the Spark behavior against the Comet implementation in the PR.** Identify: |
| 85 | + - Edge cases tested in Spark but not in the PR |
| 86 | + - Data types supported in Spark but not handled in the PR |
| 87 | + - Behavioral differences that should be marked `Incompatible` |
| 88 | + |
| 89 | +6. **Suggest additional tests** for any edge cases or type combinations covered in Spark's tests that are missing from the PR's tests. |
| 90 | + |
| 91 | +### 3. Spark Compatibility Check |
| 92 | + |
| 93 | +**This is the most critical aspect of Comet reviews.** Comet must produce identical results to Spark. |
| 94 | + |
| 95 | +For expression PRs, verify against the Spark source you read in step 2: |
| 96 | + |
| 97 | +1. **Check edge cases** |
| 98 | + - Null handling |
| 99 | + - Overflow behavior |
| 100 | + - Empty input behavior |
| 101 | + - Type-specific behavior |
| 102 | + |
| 103 | +2. **Verify all data types are handled** |
| 104 | + - Does Spark support this type? (Check `inputTypes` in Spark source) |
| 105 | + - Does the PR handle all Spark-supported types? |
| 106 | + |
| 107 | +3. **Check for ANSI mode differences** |
| 108 | + - Spark behavior may differ between legacy and ANSI modes |
| 109 | + - PR should handle both or mark as `Incompatible` |
| 110 | + |
| 111 | +### 4. Check Against Implementation Guidelines |
| 112 | + |
| 113 | +**Always verify PRs follow the implementation guidelines.** |
| 114 | + |
| 115 | +#### Scala Serde (`spark/src/main/scala/org/apache/comet/serde/`) |
| 116 | + |
| 117 | +- [ ] Expression class correctly identified |
| 118 | +- [ ] All child expressions converted via `exprToProtoInternal` |
| 119 | +- [ ] Return type correctly serialized |
| 120 | +- [ ] `getSupportLevel` reflects true compatibility: |
| 121 | + - `Compatible()` - matches Spark exactly |
| 122 | + - `Incompatible(Some("reason"))` - differs in documented ways |
| 123 | + - `Unsupported(Some("reason"))` - cannot be implemented |
| 124 | +- [ ] Serde in appropriate file (`datetime.scala`, `strings.scala`, `arithmetic.scala`, etc.) |
| 125 | + |
| 126 | +#### Registration (`QueryPlanSerde.scala`) |
| 127 | + |
| 128 | +- [ ] Added to correct map (temporal, string, arithmetic, etc.) |
| 129 | +- [ ] No duplicate registrations |
| 130 | +- [ ] Import statement added |
| 131 | + |
| 132 | +#### Rust Implementation (if applicable) |
| 133 | + |
| 134 | +Location: `native/spark-expr/src/` |
| 135 | + |
| 136 | +- [ ] Matches DataFusion and Arrow conventions |
| 137 | +- [ ] Null handling is correct |
| 138 | +- [ ] No panics. Use `Result` types. |
| 139 | +- [ ] Efficient array operations (avoid row-by-row) |
| 140 | + |
| 141 | +#### Tests - Prefer SQL File-Based Framework |
| 142 | + |
| 143 | +**Expression tests should use the SQL file-based framework (`CometSqlFileTestSuite`) where possible.** This framework automatically runs each query through both Spark and Comet and compares results. No Scala code is needed. Only fall back to Scala tests in `CometExpressionSuite` when the SQL framework cannot express the test. Examples include complex `DataFrame` setup, programmatic data generation, or non-expression tests. |
| 144 | + |
| 145 | +**SQL file test location:** `spark/src/test/resources/sql-tests/expressions/<category>/` |
| 146 | + |
| 147 | +Categories include: `aggregate/`, `array/`, `string/`, `math/`, `struct/`, `map/`, `datetime/`, `hash/`, etc. |
| 148 | + |
| 149 | +**SQL file structure:** |
| 150 | + |
| 151 | +```sql |
| 152 | +-- ConfigMatrix: parquet.enable.dictionary=false,true |
| 153 | + |
| 154 | +-- Create test data |
| 155 | +statement |
| 156 | +CREATE TABLE test_crc32(col string, a int, b float) USING parquet |
| 157 | + |
| 158 | +statement |
| 159 | +INSERT INTO test_crc32 VALUES ('Spark', 10, 1.5), (NULL, NULL, NULL), ('', 0, 0.0) |
| 160 | + |
| 161 | +-- Default mode: verifies native Comet execution + result matches Spark |
| 162 | +query |
| 163 | +SELECT crc32(col) FROM test_crc32 |
| 164 | + |
| 165 | +-- spark_answer_only: compares results without requiring native execution |
| 166 | +query spark_answer_only |
| 167 | +SELECT crc32(cast(a as string)) FROM test_crc32 |
| 168 | + |
| 169 | +-- tolerance: allows numeric variance for floating-point results |
| 170 | +query tolerance=0.0001 |
| 171 | +SELECT cos(v) FROM test_trig |
| 172 | + |
| 173 | +-- expect_fallback: asserts fallback to Spark occurs |
| 174 | +query expect_fallback(unsupported expression) |
| 175 | +SELECT unsupported_func(v) FROM test_table |
| 176 | + |
| 177 | +-- expect_error: verifies both engines throw matching exceptions |
| 178 | +query expect_error(ARITHMETIC_OVERFLOW) |
| 179 | +SELECT 2147483647 + 1 |
| 180 | + |
| 181 | +-- ignore: skip queries with known bugs (include GitHub issue link) |
| 182 | +query ignore(https://github.com/apache/datafusion-comet/issues/NNNN) |
| 183 | +SELECT known_buggy_expr(v) FROM test_table |
| 184 | +``` |
| 185 | + |
| 186 | +**Running SQL file tests:** |
| 187 | + |
| 188 | +```bash |
| 189 | +# All SQL file tests |
| 190 | +./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite" -Dtest=none |
| 191 | + |
| 192 | +# Specific test file (substring match) |
| 193 | +./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite crc32" -Dtest=none |
| 194 | +``` |
| 195 | + |
| 196 | +**CRITICAL: Verify all test requirements (regardless of framework):** |
| 197 | + |
| 198 | +- [ ] Basic functionality tested (column data, not just literals) |
| 199 | +- [ ] Null handling tested (`SELECT expression(NULL)`) |
| 200 | +- [ ] Edge cases tested (empty input, overflow, boundary values) |
| 201 | +- [ ] Both literal values and column references tested (they use different code paths) |
| 202 | +- [ ] For timestamp/datetime expressions, timezone handling is tested (e.g., UTC, non-UTC session timezone, timestamps with and without timezone) |
| 203 | +- [ ] One expression per SQL file for easier debugging |
| 204 | +- [ ] If using Scala tests instead, literal tests MUST disable constant folding: |
| 205 | + ```scala |
| 206 | + withSQLConf(SQLConf.OPTIMIZER_EXCLUDED_RULES.key -> |
| 207 | + "org.apache.spark.sql.catalyst.optimizer.ConstantFolding") { |
| 208 | + checkSparkAnswerAndOperator("SELECT func(literal)") |
| 209 | + } |
| 210 | + ``` |
| 211 | + |
| 212 | +### 5. Performance Review (Expression PRs) |
| 213 | + |
| 214 | +**For PRs that add new expressions, performance is not optional.** The whole point of Comet is to be faster than Spark. If a new expression is not faster, it may not be worth adding. |
| 215 | + |
| 216 | +1. **Check that the PR includes microbenchmark results.** The PR description should contain benchmark numbers comparing Comet vs Spark for the new expression. If benchmark results are missing, flag this as a required addition. |
| 217 | + |
| 218 | +2. **Look for a microbenchmark implementation.** Expression benchmarks live in `spark/src/test/scala/org/apache/spark/sql/benchmark/`. Check whether the PR adds a benchmark for the new expression. |
| 219 | + |
| 220 | +3. **Review the benchmark results if provided:** |
| 221 | + - Is Comet actually faster than Spark for this expression? |
| 222 | + - Are the benchmarks representative? They should test with realistic data sizes, not just trivial inputs. |
| 223 | + - Are different data types benchmarked if the expression supports multiple types? |
| 224 | + |
| 225 | +4. **Review the Rust implementation for performance concerns:** |
| 226 | + - Unnecessary allocations or copies |
| 227 | + - Row-by-row processing where batch/array operations are possible |
| 228 | + - Redundant type conversions |
| 229 | + - Inefficient string handling (e.g., repeated UTF-8 validation) |
| 230 | + - Missing use of Arrow compute kernels where they exist |
| 231 | + |
| 232 | +5. **If benchmark results show Comet is slower than Spark**, flag this clearly. The PR should explain why the regression is acceptable or include a plan to optimize. |
| 233 | + |
| 234 | +### 6. Check CI Test Failures |
| 235 | + |
| 236 | +**Always check the CI status and summarize any test failures in your review.** |
| 237 | + |
| 238 | +```bash |
| 239 | +# View CI check status |
| 240 | +gh pr checks $ARGUMENTS --repo apache/datafusion-comet |
| 241 | + |
| 242 | +# View failed check details |
| 243 | +gh pr checks $ARGUMENTS --repo apache/datafusion-comet --failed |
| 244 | +``` |
| 245 | + |
| 246 | +### 7. Documentation Check |
| 247 | + |
| 248 | +Check whether the PR requires updates to user-facing documentation in `docs/`: |
| 249 | + |
| 250 | +- **Compatibility guide** (`docs/source/user-guide/compatibility.md`): New expressions or operators should be listed. Incompatible behaviors should be documented. |
| 251 | +- **Configuration guide** (`docs/source/user-guide/configs.md`): New config options should be documented. |
| 252 | +- **Expressions list** (`docs/source/user-guide/expressions.md`): New expressions should be added. |
| 253 | + |
| 254 | +If the PR adds a new expression or operator but does not update the relevant docs, flag this as something that needs to be addressed. |
| 255 | + |
| 256 | +### 8. Common Comet Review Issues |
| 257 | + |
| 258 | +1. **Incomplete type support**: Spark expression supports types not handled in PR |
| 259 | +2. **Missing edge cases**: Null, overflow, empty string, negative values |
| 260 | +3. **Wrong return type**: Return type must match Spark exactly |
| 261 | +4. **Tests in wrong framework**: Expression tests should use the SQL file-based framework (`CometSqlFileTestSuite`) rather than adding to Scala test suites like `CometExpressionSuite`. Suggest migration if the PR adds Scala tests for expressions that could use SQL files instead. |
| 262 | +5. **Stale native code**: PR might need `./mvnw install -pl common -DskipTests` |
| 263 | +6. **Missing `getSupportLevel`**: Edge cases should be marked as `Incompatible` |
| 264 | + |
| 265 | +--- |
| 266 | + |
| 267 | +## Output Format |
| 268 | + |
| 269 | +Present your review as guidance for the reviewer. Structure your output as: |
| 270 | + |
| 271 | +1. **PR Summary** - Brief description of what the PR does |
| 272 | +2. **CI Status** - Summary of CI check results |
| 273 | +3. **Findings** - Your analysis organized by area (Spark compatibility, implementation, tests, etc.) |
| 274 | +4. **Suggested Review Comments** - Specific comments the reviewer could leave on the PR, with file and line references where applicable |
| 275 | + |
| 276 | +## Review Tone and Style |
| 277 | + |
| 278 | +Write reviews that sound human and conversational. Avoid: |
| 279 | + |
| 280 | +- Robotic or formulaic language |
| 281 | +- Em dashes. Use separate sentences instead. |
| 282 | +- Semicolons. Use separate sentences instead. |
| 283 | + |
| 284 | +Instead: |
| 285 | + |
| 286 | +- Write in flowing paragraphs using simple grammar |
| 287 | +- Keep sentences short and separate rather than joining them with punctuation |
| 288 | +- Be kind and constructive, even when raising concerns |
| 289 | +- Use backticks around any code references (function names, file paths, class names, types, config keys, etc.) |
| 290 | +- **Suggest** adding tests rather than stating tests are missing (e.g., "It might be worth adding a test for X" not "Tests are missing for X") |
| 291 | +- **Ask questions** about edge cases rather than asserting they aren't handled (e.g., "Does this handle the case where X is null?" not "This doesn't handle null") |
| 292 | +- Frame concerns as questions or suggestions when possible |
| 293 | +- Acknowledge what the PR does well before raising concerns |
| 294 | + |
| 295 | +## Do Not Post Comments |
| 296 | + |
| 297 | +**IMPORTANT: Never post comments or reviews on the PR directly.** This skill is for providing guidance to a human reviewer. Present all findings and suggested comments to the user. The user will decide what to post. |
0 commit comments