Skip to content

Structural Aggregate IR Invariants#1242

Merged
zonotope merged 6 commits into
mainfrom
refactor/ir-contract
May 21, 2026
Merged

Structural Aggregate IR Invariants#1242
zonotope merged 6 commits into
mainfrom
refactor/ir-contract

Conversation

@zonotope
Copy link
Copy Markdown
Contributor

Summary

Refactors the query IR's representation of aggregate functions so that fields previously documented as "this only applies when…" are now carried inline on the variants where they apply. Removes a distinct: bool flag and an input_var: Option<VarId> from the spec struct in favor of variant-specific data, introduces an explicit InputSemantics enum for the DISTINCT modifier, and tightens up the inherent API on NonEmpty<T> from fluree-db-core so it's pleasant enough to use that more of the IR can rely on it going forward.

Motivation

The previous shape of AggregateSpec encoded several invariants by convention rather than by type:

  • input_var: Option<VarId> was None for COUNT(*) and Some(_) for every other variant. Nothing in the type prevented a Min aggregate with input_var: None (a runtime panic waiting to happen) or a CountAll with input_var: Some(_) (meaningless data the executor had to ignore).
  • distinct: bool rode on every aggregate spec, including variants where DISTINCT is a semantic no-op (MIN, MAX, SAMPLE) and one variant — COUNT(DISTINCT) — that already has a dedicated streaming-state representation. The SPARQL lowering layer carried a comment apologizing for clearing the flag on CountDistinct so the streaming aggregator wouldn't double-dedup.
  • Both the JSON-LD parser and the SPARQL lowering pass had to construct the spec and remember to keep these correlations consistent. The planner and executor pattern-matched on the function, then re-read the auxiliary fields to know what was actually being asked for.

This branch makes those correlations structural so that "Sum without an input variable" and "Min with DISTINCT" are no longer states the IR can express.

Changes

AggregateFn now carries each variant's input variable and DISTINCT mode inline

fluree-db-query/src/ir/grouping.rs. The old flat enum (Count, CountDistinct, Sum, Avg, Min, Max, Sample, Median, Variance, Stddev, GroupConcat { separator }) is replaced with variants that carry their data:

pub enum AggregateFn {
    Count(VarId),
    CountAll,
    CountDistinct(VarId),
    Sum(VarId, InputSemantics),
    Avg(VarId, InputSemantics),
    Min(VarId),
    Max(VarId),
    Median(VarId, InputSemantics),
    Variance(VarId, InputSemantics),
    Stddev(VarId, InputSemantics),
    GroupConcat { input: VarId, semantics: InputSemantics, separator: String },
    Sample(VarId),
}

The input variable is part of every variant except CountAll, which counts rows regardless of values. Variants where DISTINCT actually changes the result carry an InputSemantics; Min, Max, and Sample don't, because their values are unchanged by deduplication. CountDistinct stays a separate variant because its streaming state is a HashSet rather than a counter — that distinction is real and meaningful even though it parallels Count(VarId) + InputSemantics::Set at the surface level.

InputSemantics names what DISTINCT actually does

pub enum InputSemantics {
    /// Treat the input as a list — every occurrence counted (default).
    List,
    /// Treat the input as a set — duplicates collapsed (DISTINCT).
    Set,
}

The aggregate always receives a multiset of Bindings from upstream; what DISTINCT changes is how the function interprets that input. Naming it InputSemantics rather than Distinct/Multiset/Quantifier captures that the modifier is about input interpretation, not about the input's type or any quantification semantics.

AggregateSpec shrinks to "function plus output binding"

pub struct AggregateSpec {
    pub function: AggregateFn,
    pub output_var: VarId,
}

Everything variant-correlated (input variable, DISTINCT mode, GROUP_CONCAT separator) lives on function. The spec keeps only the output binding — a property of the operator wiring, not of the aggregate itself. The same shrinkage carries through to StreamingAggSpec in the executor.

Methods on AggregateFn

  • input_var() -> Option<VarId> returns the read variable for every variant except CountAll.
  • is_distinct() -> bool returns whether DISTINCT was requested — true for CountDistinct and for any variant whose InputSemantics is Set; always false for Min/Max/Sample/Count/CountAll, which don't carry the modifier at all.
  • apply(&self, &Binding) -> Binding is the public entry point for computing the aggregate. Replaces the free function apply_aggregate(func, binding, distinct) previously re-exported from the crate. Private needs_input_dedup() and compute() helpers split the dedup pass from the reduction.

SPARQL COUNT(DISTINCT *) is now an explicit error

Folded into the variant-correlated lowering: COUNT(*) DISTINCT was previously short-circuited by checking matches!(agg_fn, AggregateFn::CountAll) && *distinct after construction. The new lowering rejects it at the AST → IR boundary with a structured not_implemented error rather than constructing a degenerate spec.

Cascading simplifications in the planner and executor

execute/operator_tree.rs shrinks by ~100 lines. The aggregate fast-path detectors (detect_count_all_aggregate, detect_count_distinct_aggregate, detect_predicate_minmax_string, detect_predicate_avg_numeric, detect_fused_scan_sum_i64, detect_group_by_object_star_topk, detect_sum_strlen_group_concat_subquery, …) now pattern-match directly on AggregateFn variants and bind the input variable in the match arm instead of reading agg.input_var and checking agg.distinct separately. Multiple "is this distinct?" guards collapse to a single match.

group_aggregate.rs's all_streamable() check becomes a single match on the function. DISTINCT SUM/AVG are still excluded (they need every value collected before reduction), but the routing is now structural.

execute/dependency.rs's aggregate-dep tracing uses spec.function.input_var() instead of reading spec.input_var directly. The variable-dependency walk no longer has to know that some specs lack input variables.

NonEmpty<T> gets a more complete API

fluree-db-core/src/nonempty.rs. Additive: existing callers continue to work. New surface:

  • singleton(head) and from_head_tail(head, tail) as named constructors.
  • nonempty![a, b, c] macro that statically rejects an empty literal (compile error, not runtime panic).
  • iter_mut(), into_iter() (consuming), last(), push(), extend(), map() matching the corresponding Vec operations. The non-empty invariant is preserved trivially in each: push/extend only grow the tail; map preserves length; last() falls back to head when the tail is empty.
  • IntoIterator impls for &NonEmpty<T>, &mut NonEmpty<T>, and NonEmpty<T>. Lets for x in &ne { … } work without going through .iter() explicitly.
  • Index<usize>. Out-of-range panics like Vec/slice indexing; index 0 is guaranteed by the invariant.

This is groundwork: the API used to be just try_from_vec, iter, first, len, into_vec, which made it awkward enough that callers tended to stick with Vec<T> even where the non-empty invariant was real.

What's not in this branch

  • Higher-order pattern bodies (Optional, Minus, Exists/NotExists, Union) remain Vec<Pattern>. An earlier attempt to migrate them to NonEmpty<Pattern> was reverted: the structural win was real but the iteration-contract change cascaded into perf-sensitive executor code in ways that dwarfed the payoff.
  • Pattern::Union(Vec<Vec<Pattern>>), SubqueryPattern::select: Vec<VarId>, and ConstructTemplate::patterns: Vec<TriplePattern> are all shape-of-cleanups that would similarly want NonEmpty and are deferred for the same reason.
  • The Expression::Function enum's 86 variants are still combined with a free-form Vec<Expression> argument list, leaving runtime arity checks scattered across the eval layer. Splitting them by arity is the biggest remaining structural win in the IR but is similarly cascading; not addressed here.

@zonotope zonotope requested review from aaj3f and bplatz May 15, 2026 15:01
Copy link
Copy Markdown
Contributor

@bplatz bplatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

@zonotope zonotope merged commit 90e68e0 into main May 21, 2026
14 checks passed
@zonotope zonotope deleted the refactor/ir-contract branch May 21, 2026 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants