Structural Aggregate IR Invariants#1242
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactors the query IR's representation of aggregate functions so that fields previously documented as "this only applies when…" are now carried inline on the variants where they apply. Removes a
distinct: boolflag and aninput_var: Option<VarId>from the spec struct in favor of variant-specific data, introduces an explicitInputSemanticsenum for theDISTINCTmodifier, and tightens up the inherent API onNonEmpty<T>fromfluree-db-coreso it's pleasant enough to use that more of the IR can rely on it going forward.Motivation
The previous shape of
AggregateSpecencoded several invariants by convention rather than by type:input_var: Option<VarId>wasNoneforCOUNT(*)andSome(_)for every other variant. Nothing in the type prevented aMinaggregate withinput_var: None(a runtime panic waiting to happen) or aCountAllwithinput_var: Some(_)(meaningless data the executor had to ignore).distinct: boolrode on every aggregate spec, including variants whereDISTINCTis a semantic no-op (MIN,MAX,SAMPLE) and one variant —COUNT(DISTINCT)— that already has a dedicated streaming-state representation. The SPARQL lowering layer carried a comment apologizing for clearing the flag onCountDistinctso the streaming aggregator wouldn't double-dedup.This branch makes those correlations structural so that "Sum without an input variable" and "Min with DISTINCT" are no longer states the IR can express.
Changes
AggregateFnnow carries each variant's input variable and DISTINCT mode inlinefluree-db-query/src/ir/grouping.rs. The old flat enum (Count,CountDistinct,Sum,Avg,Min,Max,Sample,Median,Variance,Stddev,GroupConcat { separator }) is replaced with variants that carry their data:The input variable is part of every variant except
CountAll, which counts rows regardless of values. Variants whereDISTINCTactually changes the result carry anInputSemantics;Min,Max, andSampledon't, because their values are unchanged by deduplication.CountDistinctstays a separate variant because its streaming state is aHashSetrather than a counter — that distinction is real and meaningful even though it parallelsCount(VarId)+InputSemantics::Setat the surface level.InputSemanticsnames whatDISTINCTactually doesThe aggregate always receives a multiset of
Bindings from upstream; whatDISTINCTchanges is how the function interprets that input. Naming itInputSemanticsrather thanDistinct/Multiset/Quantifiercaptures that the modifier is about input interpretation, not about the input's type or any quantification semantics.AggregateSpecshrinks to "function plus output binding"Everything variant-correlated (input variable,
DISTINCTmode,GROUP_CONCATseparator) lives onfunction. The spec keeps only the output binding — a property of the operator wiring, not of the aggregate itself. The same shrinkage carries through toStreamingAggSpecin the executor.Methods on
AggregateFninput_var() -> Option<VarId>returns the read variable for every variant exceptCountAll.is_distinct() -> boolreturns whetherDISTINCTwas requested — true forCountDistinctand for any variant whoseInputSemanticsisSet; always false forMin/Max/Sample/Count/CountAll, which don't carry the modifier at all.apply(&self, &Binding) -> Bindingis the public entry point for computing the aggregate. Replaces the free functionapply_aggregate(func, binding, distinct)previously re-exported from the crate. Privateneeds_input_dedup()andcompute()helpers split the dedup pass from the reduction.SPARQL
COUNT(DISTINCT *)is now an explicit errorFolded into the variant-correlated lowering:
COUNT(*) DISTINCTwas previously short-circuited by checkingmatches!(agg_fn, AggregateFn::CountAll) && *distinctafter construction. The new lowering rejects it at the AST → IR boundary with a structurednot_implementederror rather than constructing a degenerate spec.Cascading simplifications in the planner and executor
execute/operator_tree.rsshrinks by ~100 lines. The aggregate fast-path detectors (detect_count_all_aggregate,detect_count_distinct_aggregate,detect_predicate_minmax_string,detect_predicate_avg_numeric,detect_fused_scan_sum_i64,detect_group_by_object_star_topk,detect_sum_strlen_group_concat_subquery, …) now pattern-match directly onAggregateFnvariants and bind the input variable in the match arm instead of readingagg.input_varand checkingagg.distinctseparately. Multiple "is this distinct?" guards collapse to a single match.group_aggregate.rs'sall_streamable()check becomes a single match on the function.DISTINCT SUM/AVGare still excluded (they need every value collected before reduction), but the routing is now structural.execute/dependency.rs's aggregate-dep tracing usesspec.function.input_var()instead of readingspec.input_vardirectly. The variable-dependency walk no longer has to know that some specs lack input variables.NonEmpty<T>gets a more complete APIfluree-db-core/src/nonempty.rs. Additive: existing callers continue to work. New surface:singleton(head)andfrom_head_tail(head, tail)as named constructors.nonempty![a, b, c]macro that statically rejects an empty literal (compile error, not runtime panic).iter_mut(),into_iter()(consuming),last(),push(),extend(),map()matching the correspondingVecoperations. The non-empty invariant is preserved trivially in each:push/extendonly grow the tail;mappreserves length;last()falls back toheadwhen the tail is empty.IntoIteratorimpls for&NonEmpty<T>,&mut NonEmpty<T>, andNonEmpty<T>. Letsfor x in &ne { … }work without going through.iter()explicitly.Index<usize>. Out-of-range panics likeVec/slice indexing; index 0 is guaranteed by the invariant.This is groundwork: the API used to be just
try_from_vec,iter,first,len,into_vec, which made it awkward enough that callers tended to stick withVec<T>even where the non-empty invariant was real.What's not in this branch
Optional,Minus,Exists/NotExists,Union) remainVec<Pattern>. An earlier attempt to migrate them toNonEmpty<Pattern>was reverted: the structural win was real but the iteration-contract change cascaded into perf-sensitive executor code in ways that dwarfed the payoff.Pattern::Union(Vec<Vec<Pattern>>),SubqueryPattern::select: Vec<VarId>, andConstructTemplate::patterns: Vec<TriplePattern>are all shape-of-cleanups that would similarly wantNonEmptyand are deferred for the same reason.Expression::Functionenum's 86 variants are still combined with a free-formVec<Expression>argument list, leaving runtime arity checks scattered across the eval layer. Splitting them by arity is the biggest remaining structural win in the IR but is similarly cascading; not addressed here.