Skip to content

feat: close gaps 1, 2, 8 — struct dot-access, dedup qualify, WHERE filter lineage#65

Merged
mingjerli merged 3 commits into
mainfrom
feature/gaps-1-2-8
Apr 15, 2026
Merged

feat: close gaps 1, 2, 8 — struct dot-access, dedup qualify, WHERE filter lineage#65
mingjerli merged 3 commits into
mainfrom
feature/gaps-1-2-8

Conversation

@mingjerli
Copy link
Copy Markdown
Owner

Summary

  • Gap 8: Emit where_filter edges for all column refs in WHERE clauses, targeting all non-star output columns. New WherePredicateInfo dataclass and _extract_where_columns helper that skips subquery subtrees.
  • Gap 1: Struct dot-access fallback (after.id) — when a table qualifier is unresolvable, recursively walk the unit dependency chain to find base tables and emit struct edges with nested_path and access_type="struct". Also wired into _resolve_qualify_column.
  • Gap 2: Promote qualify_info from subquery-based dedup patterns (ROW_NUMBER() ... AS rn + WHERE rn = 1/<=/<). New ranking_window_columns field on QueryUnit and _promote_dedup_qualify_if_applicable parser method.

Closes all 3 remaining open gaps from the CDC/SCD pipeline stress test.

Test plan

  • 15 new tests in test_where_filter_lineage.py (simple, compound, subquery exclusion, SELECT *, dataclass fields)
  • 13 new tests in test_struct_dot_access.py (single-table, CDC subquery, bracket regression, multi-table JOIN, empty fallback)
  • 7 new tests in test_subquery_dedup_qualify.py (EQ/LTE/LT promotion, non-ranking rejection, explicit QUALIFY preservation)
  • Full suite: 1512 passed, 40 skipped, 2 xfailed, 0 failures (baseline was 1477)
  • No regressions — all existing tests pass

🤖 Generated with Claude Code

Add WHERE filter lineage tracking: columns referenced in WHERE clauses
now produce where_filter edges to all non-star output columns. Subquery
columns within WHERE are excluded from the outer query's predicates.

- Add WherePredicateInfo dataclass and where_predicates to QueryUnit
- Add is_where_filter and where_condition fields to ColumnEdge
- Extract WHERE column refs in query parser (skipping subquery subtrees)
- Create where_filter edges in lineage builder
- Preserve WHERE metadata in pipeline cross-query edge copies
- Fix trace_forward BFS to treat nodes as terminals when all outgoing
  targets are already visited (prevents cycles from breaking traversal)
When sqlglot parses `after.id` as Column(table="after", name="id") and
"after" cannot be resolved as a table, alias, or unit in scope, the new
struct fallback emits a lineage edge with nested_path=".id" and
access_type="struct", using the first base table from the dependency
chain as the source table. Includes recursive base table resolution for
CDC-like subquery patterns.
…n (Gap 2)

Detect the common dedup pattern (ROW_NUMBER() OVER (...) AS rn in subquery
+ WHERE rn = 1 in outer query) and promote it to qualify_info on the outer
unit. Supports EQ, LTE, LT comparisons against ranking functions
(ROW_NUMBER, RANK, DENSE_RANK, NTILE). Adds ranking_window_columns to
QueryUnit model for cross-unit metadata propagation.
@mingjerli mingjerli merged commit 7b98e6e into main Apr 15, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant