Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds first-class A/B regression checking to diskann-benchmark-runner by introducing a regression-capable benchmark API (tolerance inputs + before/after comparison) and a new check CLI surface, with an example adoption in diskann-benchmark-simd.
Changes:
- Introduce
benchmark::Regression+ registry plumbing to register and discover regression-capable benchmarks and their tolerance input types. - Add tolerance-file parsing, subset-based matching, and
check verify/check runexecution + reporting pipeline. - Extend the UX test harness and add many new golden tests covering success/failure/error paths; add SIMD example tolerances + regression implementation.
Reviewed changes
Copilot reviewed 103 out of 143 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| diskann-benchmark-simd/src/lib.rs | Adds SIMD regression tolerance type and Regression implementation for kernels; updates run result schema. |
| diskann-benchmark-simd/src/bin.rs | Adds integration coverage for check verify on the SIMD example. |
| diskann-benchmark-simd/examples/tolerance.json | Example tolerance file for SIMD regression checks. |
| diskann-benchmark-runner/src/lib.rs | Exposes benchmark publicly and adds internal module for shared helpers. |
| diskann-benchmark-runner/src/benchmark.rs | Adds Regression trait, PassFail, and type-erased regression support in the registry layer. |
| diskann-benchmark-runner/src/registry.rs | Adds register_regression and tolerance discovery/mapping to regression-capable benchmarks. |
| diskann-benchmark-runner/src/internal/mod.rs | Adds shared load_from_disk helper and internal module organization. |
| diskann-benchmark-runner/src/internal/regression.rs | Implements tolerance parsing, subset matching, check job creation, execution, JSON output, and reporting. |
| diskann-benchmark-runner/src/app.rs | Adds check subcommand (skeleton, tolerances, verify, run) and upgrades UX test harness for multi-step scenarios. |
| diskann-benchmark-runner/src/jobs.rs | Refactors input loading/parsing and exposes raw job list/partial parsing for regression pipeline. |
| diskann-benchmark-runner/src/result.rs | Adds RawResult loader for reading previously saved benchmark outputs. |
| diskann-benchmark-runner/src/checker.rs | Tightens Checker::any tagging expectation (with clippy annotation). |
| diskann-benchmark-runner/src/input.rs | Adds Wrapper::<T>::INSTANCE const for tolerance type-erasure usage. |
| diskann-benchmark-runner/src/ux.rs | Adds scrub_path helper for deterministic UX test output. |
| diskann-benchmark-runner/src/utils/mod.rs | Exports new num utilities module. |
| diskann-benchmark-runner/src/utils/num.rs | Adds relative_change and NonNegativeFinite for regression/tolerance validation and comparisons. |
| diskann-benchmark-runner/src/utils/percentiles.rs | Adds minimum percentile value to output structure (and marks struct #[non_exhaustive]). |
| diskann-benchmark-runner/src/utils/fmt.rs | Adds clippy expectation annotation for panic-based bounds checks. |
| diskann-benchmark-runner/src/test/mod.rs | Reorganizes test benchmark registration and marks regression-capable test benchmarks. |
| diskann-benchmark-runner/src/test/dim.rs | Adds regression checks to dim benchmarks and introduces a non-regression “simple” benchmark. |
| diskann-benchmark-runner/src/test/typed.rs | Adds regression checks to typed benchmarks and introduces tolerance input used by typed regression tests. |
| diskann-benchmark-runner/Cargo.toml | Switches to explicit clippy lint configuration. |
| diskann-benchmark-runner/.clippy.toml | Allows unwrap/expect/panic in tests under clippy. |
| diskann-benchmark-runner/tests/regression/check-skeleton-0/stdin.txt | Adds UX test for check skeleton. |
| diskann-benchmark-runner/tests/regression/check-skeleton-0/stdout.txt | Golden output for check skeleton. |
| diskann-benchmark-runner/tests/regression/check-skeleton-0/README.md | Documents check skeleton UX test scenario. |
| diskann-benchmark-runner/tests/regression/check-tolerances-0/stdin.txt | Adds UX test for listing tolerance kinds. |
| diskann-benchmark-runner/tests/regression/check-tolerances-0/stdout.txt | Golden output for listing tolerance kinds. |
| diskann-benchmark-runner/tests/regression/check-tolerances-0/README.md | Documents tolerance listing UX test. |
| diskann-benchmark-runner/tests/regression/check-tolerances-1/stdin.txt | Adds UX test for describing a specific tolerance kind. |
| diskann-benchmark-runner/tests/regression/check-tolerances-1/stdout.txt | Golden output for tolerance kind description/skeleton. |
| diskann-benchmark-runner/tests/regression/check-tolerances-1/README.md | Documents tolerance kind description UX test. |
| diskann-benchmark-runner/tests/regression/check-tolerances-2/stdin.txt | Adds UX test for requesting an unknown tolerance kind. |
| diskann-benchmark-runner/tests/regression/check-tolerances-2/stdout.txt | Golden output for unknown tolerance kind. |
| diskann-benchmark-runner/tests/regression/check-tolerances-2/README.md | Documents unknown tolerance kind behavior. |
| diskann-benchmark-runner/tests/regression/check-verify-0/stdin.txt | Adds UX test for successful check verify. |
| diskann-benchmark-runner/tests/regression/check-verify-0/tolerances.json | Test tolerance file for successful verification. |
| diskann-benchmark-runner/tests/regression/check-verify-0/input.json | Test input file used for successful verification. |
| diskann-benchmark-runner/tests/regression/check-verify-0/README.md | Documents successful verification behavior (no stdout). |
| diskann-benchmark-runner/tests/regression/check-verify-1/stdin.txt | Adds UX test for unknown tolerance tag error. |
| diskann-benchmark-runner/tests/regression/check-verify-1/stdout.txt | Golden output for unknown tolerance tag error. |
| diskann-benchmark-runner/tests/regression/check-verify-1/tolerances.json | Test tolerance file with unknown tolerance tag. |
| diskann-benchmark-runner/tests/regression/check-verify-1/input.json | Input file used by unknown tolerance tag test. |
| diskann-benchmark-runner/tests/regression/check-verify-1/README.md | Documents unknown tolerance tag error scenario. |
| diskann-benchmark-runner/tests/regression/check-verify-2/stdin.txt | Adds UX test for tolerance/input match but no regression benchmark dispatch. |
| diskann-benchmark-runner/tests/regression/check-verify-2/stdout.txt | Golden output for “no matching regression benchmark” in verify. |
| diskann-benchmark-runner/tests/regression/check-verify-2/tolerances.json | Tolerance file used by the dispatch-failure verify test. |
| diskann-benchmark-runner/tests/regression/check-verify-2/input.json | Input file used by the dispatch-failure verify test. |
| diskann-benchmark-runner/tests/regression/check-verify-2/README.md | Documents dispatch-failure verify scenario. |
| diskann-benchmark-runner/tests/regression/check-verify-3/stdin.txt | Adds UX test covering ambiguous/orphaned/uncovered tolerance matching problems. |
| diskann-benchmark-runner/tests/regression/check-verify-3/stdout.txt | Golden output for matching failure diagnostics. |
| diskann-benchmark-runner/tests/regression/check-verify-3/tolerances.json | Tolerance file constructed to trigger matching errors. |
| diskann-benchmark-runner/tests/regression/check-verify-3/input.json | Input file constructed to trigger matching errors. |
| diskann-benchmark-runner/tests/regression/check-verify-4/stdin.txt | Adds UX test for incompatible tolerance/input tag pairing. |
| diskann-benchmark-runner/tests/regression/check-verify-4/stdout.txt | Golden output for incompatible tag pairing error. |
| diskann-benchmark-runner/tests/regression/check-verify-4/tolerances.json | Tolerance file with mismatched input tag type. |
| diskann-benchmark-runner/tests/regression/check-verify-4/input.json | Input file used by incompatible tag pairing test. |
| diskann-benchmark-runner/tests/regression/check-verify-4/README.md | Documents incompatible tag pairing test (currently needs alignment with actual scenario). |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/stdin.txt | Adds UX test for successful check run and checks.json generation. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/stdout.txt | Golden output for successful regression checks. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/input.json | Input file for passing run scenario. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/tolerances.json | Tolerance file for passing run scenario. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/output.json | Golden output.json produced during setup in passing run scenario. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/checks.json | Golden checks.json produced by check run in pass scenario. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/README.md | Documents pass scenario coverage. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/stdin.txt | Adds UX test where regression check fails (but checks.json still written). |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/stdout.txt | Golden output for failing regression checks. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/input.json | Input file for failing run scenario. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/tolerances.json | Tolerance file that triggers a failure. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/output.json | Golden output.json for failing run scenario setup. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/checks.json | Golden checks.json for failing run scenario. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/README.md | Documents failing regression run scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/stdin.txt | Adds UX test where check execution errors are surfaced and recorded. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/stdout.txt | Golden output for erroring regression checks. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/input.json | Input file for error scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/tolerances.json | Tolerance file that triggers check errors. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/output.json | Golden output.json for error scenario setup. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/checks.json | Golden checks.json showing error entries. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/README.md | Documents error triage behavior and checks.json writing. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/stdin.txt | Adds UX test for before/after length mismatch detection. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/stdout.txt | Golden output for length mismatch error. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/input.json | Setup input file for generating output.json with 2 entries. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/regression_input.json | Regression input file with 1 job used to trigger length mismatch. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/tolerances.json | Tolerance file for length mismatch scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/output.json | Golden output.json with 2 entries for mismatch scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/README.md | Documents length mismatch scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/stdin.txt | Adds UX test for input drift causing “no matching regression benchmark” during run. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/stdout.txt | Golden output for drift/no-match error. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/input.json | Setup input file for generating output.json. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/regression_input.json | Drifted input file used for check run. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/tolerances.json | Tolerance file for drift scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/output.json | Golden output.json from setup run. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/README.md | Documents drift scenario expectations. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/stdin.txt | Adds UX test for before/after schema drift (deserialization error) handling. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/stdout.txt | Golden output for schema drift error. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/input.json | Setup input file used to generate integer results. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/regression_input.json | Regression input expecting string results used to trigger schema mismatch. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/tolerances.json | Tolerance file for schema drift scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/output.json | Golden output.json containing integer results. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/checks.json | Golden checks.json containing structured error output. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/README.md | Documents schema drift scenario and expectations. |
| diskann-benchmark-runner/tests/benchmark/test-0/stdin.txt | Adds baseline UX test for skeleton. |
| diskann-benchmark-runner/tests/benchmark/test-0/stdout.txt | Golden output for skeleton. |
| diskann-benchmark-runner/tests/benchmark/test-0/README.md | Documents skeleton UX test. |
| diskann-benchmark-runner/tests/benchmark/test-1/stdin.txt | Adds baseline UX test for inputs. |
| diskann-benchmark-runner/tests/benchmark/test-1/stdout.txt | Golden output for inputs list. |
| diskann-benchmark-runner/tests/benchmark/test-1/README.md | Documents inputs list UX test. |
| diskann-benchmark-runner/tests/benchmark/test-2/stdin.txt | Adds baseline UX test for inputs <NAME>. |
| diskann-benchmark-runner/tests/benchmark/test-2/stdout.txt | Golden output for inputs test-input-dim. |
| diskann-benchmark-runner/tests/benchmark/test-2/README.md | Documents inputs <NAME> behavior. |
| diskann-benchmark-runner/tests/benchmark/test-3/stdin.txt | Adds baseline UX test for inputs test-input-types. |
| diskann-benchmark-runner/tests/benchmark/test-3/stdout.txt | Golden output for inputs test-input-types. |
| diskann-benchmark-runner/tests/benchmark/test-3/README.md | Documents typed input example output. |
| diskann-benchmark-runner/tests/benchmark/test-4/stdin.txt | Adds baseline UX test for benchmarks. |
| diskann-benchmark-runner/tests/benchmark/test-4/stdout.txt | Golden output for benchmark listing (now includes simple-bench). |
| diskann-benchmark-runner/tests/benchmark/test-4/README.md | Documents benchmarks listing behavior. |
| diskann-benchmark-runner/tests/benchmark/test-success-0/stdin.txt | Adds baseline “successful run generates output.json” UX test. |
| diskann-benchmark-runner/tests/benchmark/test-success-0/stdout.txt | Golden stdout for successful run. |
| diskann-benchmark-runner/tests/benchmark/test-success-0/input.json | Input file for successful run. |
| diskann-benchmark-runner/tests/benchmark/test-success-0/output.json | Golden output.json for successful run. |
| diskann-benchmark-runner/tests/benchmark/test-success-0/README.md | Documents successful run behavior. |
| diskann-benchmark-runner/tests/benchmark/test-success-1/stdin.txt | Adds baseline --dry-run UX test. |
| diskann-benchmark-runner/tests/benchmark/test-success-1/stdout.txt | Golden stdout for dry-run. |
| diskann-benchmark-runner/tests/benchmark/test-success-1/input.json | Input file for dry-run. |
| diskann-benchmark-runner/tests/benchmark/test-success-1/README.md | Documents dry-run behavior. |
| diskann-benchmark-runner/tests/benchmark/test-overload-0/stdin.txt | Adds UX test for MatchScore overload resolution. |
| diskann-benchmark-runner/tests/benchmark/test-overload-0/stdout.txt | Golden stdout for overload resolution. |
| diskann-benchmark-runner/tests/benchmark/test-overload-0/input.json | Input file for overload resolution test. |
| diskann-benchmark-runner/tests/benchmark/test-overload-0/output.json | Golden output.json for overload resolution test. |
| diskann-benchmark-runner/tests/benchmark/test-overload-0/README.md | Documents overload resolution expectations. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdin.txt | Adds UX test for mismatch diagnostics. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdout.txt | Golden mismatch diagnostics output. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-0/input.json | Input file that triggers mismatch diagnostics. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-0/README.md | Documents mismatch diagnostics output. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdin.txt | Adds UX test for ExactTypeBench mismatch reason paths. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdout.txt | Golden output for ExactTypeBench mismatch diagnostics. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-1/input.json | Input file for ExactTypeBench mismatch diagnostics. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-1/README.md | Documents ExactTypeBench mismatch reasons. |
| diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdin.txt | Adds UX test for input deserialization error reporting. |
| diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdout.txt | Golden output for deserialization errors. |
| diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/input.json | Input file containing an invalid datatype variant. |
| diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/README.md | Documents deserialization error reporting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #[test] | ||
| fn check_verify() { | ||
| let input_path = PathBuf::from(env!("CARGO_MANIFEST_DIR")) | ||
| .join("examples") | ||
| .join("simd-scalar.json"); | ||
| let tolerance_path = PathBuf::from(env!("CARGO_MANIFEST_DIR")) | ||
| .join("examples") | ||
| .join("tolerance.json"); | ||
|
|
||
| let stdout = run_check_test(&input_path, &tolerance_path); | ||
| println!("stdout = {}", stdout); | ||
| } |
There was a problem hiding this comment.
check_verify test doesn't assert anything about the verify output (it only prints it). This can let regressions slip through while still passing and also adds noisy stdout to cargo test output. Consider asserting the expected behavior (e.g., verify exits successfully and prints nothing / prints a stable message), and remove the println! unless the test fails.
| impl RunResult { | ||
| fn computations_per_latency(&self) -> usize { | ||
| self.run.num_points.get() * self.run.loops_per_measurement.get() | ||
| } |
There was a problem hiding this comment.
computations_per_latency multiplies two usize values from user-controlled input (num_points and loops_per_measurement). This can overflow silently (wrap) in release builds and skew the computed per-point timings and regression checks. Use checked_mul (and return an error / fail the check) to avoid incorrect results on large inputs.
| /// * $INPUT_FILE - Resolves to `input.json` in the same directory as the `stdin.txt` file. | ||
| /// * $OUTPUT_FILE - Resolves to `output.json` in a temporary directory. | ||
| /// * $TOLERANCES_FILE - Resolves to `tolerances.json` in the test directory. | ||
| /// * $REGRESSION_INPUT_FILE - Resolves to `regression_input.json` test directory. | ||
| /// * $CHECK_OUTPUT_FILE - Resolves to `checks.json` in a temporary directory. |
There was a problem hiding this comment.
The UX test docs list $INPUT_FILE, $OUTPUT_FILE, $TOLERANCES_FILE, etc., but the test harness actually recognizes $INPUT, $OUTPUT, $TOLERANCES, $REGRESSION_INPUT, and $CHECK_OUTPUT. This mismatch makes the documentation misleading; update the bullet list to match the implemented tokens (or rename the tokens for consistency).
| /// * $INPUT_FILE - Resolves to `input.json` in the same directory as the `stdin.txt` file. | |
| /// * $OUTPUT_FILE - Resolves to `output.json` in a temporary directory. | |
| /// * $TOLERANCES_FILE - Resolves to `tolerances.json` in the test directory. | |
| /// * $REGRESSION_INPUT_FILE - Resolves to `regression_input.json` test directory. | |
| /// * $CHECK_OUTPUT_FILE - Resolves to `checks.json` in a temporary directory. | |
| /// * $INPUT - Resolves to `input.json` in the same directory as the `stdin.txt` file. | |
| /// * $OUTPUT - Resolves to `output.json` in a temporary directory. | |
| /// * $TOLERANCES - Resolves to `tolerances.json` in the test directory. | |
| /// * $REGRESSION_INPUT - Resolves to `regression_input.json` in the test directory. | |
| /// * $CHECK_OUTPUT - Resolves to `checks.json` in a temporary directory. |
| /// To reach this point, we at least the structure of the input JSON to be correct and | ||
| /// parseable. However, we have not yet mapped the raw JSON of any of the registered inputs. |
There was a problem hiding this comment.
Grammar in this doc comment is broken and makes the intent unclear. It currently reads "we at least the structure..."; consider changing it to something like "we require at least the structure of the input JSON to be correct and parseable".
| /// To reach this point, we at least the structure of the input JSON to be correct and | |
| /// parseable. However, we have not yet mapped the raw JSON of any of the registered inputs. | |
| /// To reach this point, we require at least the structure of the input JSON to be correct | |
| /// and parseable. However, we have not yet mapped the raw JSON of any of the registered | |
| /// inputs. |
| Here, we test that if a valid "tolerance" is linked with a valid "input" in "tolerance.json" | ||
| but there is not registered benchmark where these two tags are linked, we get a reasonable error. |
There was a problem hiding this comment.
This README describes the scenario as "no registered benchmark where these two tags are linked" and refers to "tolerance.json", but the test data here is about an incompatible input/tolerance tag pair and the file is tolerances.json. Updating this description will make the intent of the regression test clearer and prevent confusion when debugging failures.
This adds support for native A/B testing in
diskann-benchmark-runnerwith an example implementation added todiskann-benchmark-simd. The benefits of having this infrastructure at the Rust level are:Concept
The idea is to use one
input.jsonfile to generate two output JSON filesbefore.jsonandafter.jsonfor different builds or configurations of the library. Theinput.jsonis then accompanied by atolerances.json, which contains runtime thresholds for values of interest to help accommodate runtime variability. A regression check takes all such files and performs the following steps:tolerances.jsonandinput.json.tolerances.jsonto entries ininput.json. I went with matching semantics rather than requiring a one-to-one correspondence to make it easier to have a single tolerance entry work as a blanket entry for multiple benchmark runs.tolerance/inputpairs and match them with a regression-checkable benchmark.before/after.jsoninto the benchmark's output type or error gracefully if this cannot be done due to an incorrect environment.The matching semantics work like this. Each tolerance entry looks like the following:
{ "input": { "type": "type-tag", "content": {}, }, "tolerance": { "type": "tolerance-type-tag", "content": "defined-per-tolerance", } }The content field of "input" need not be deserializable to its corresponding value. Instead, we use the raw JSON of the input and match it as a "subset" against the raw JSON of the
input.jsonusing the following rules:xis a subset of an arrayyifx.len() <= y.len()and each entryiinxis a subset of its corresponding entry iny(i.e., we match prefixes).xis a subset of an objectyif each key inxis a key inyand each value associated with a key inxis a subset of the value of the same entry iny.boolcannot be a subset of aninteger. This breaks the match instantly.This means that an empty "content" field will match any struct and thus be a blanket implementation for the input with the same type tag. Or, the "content" field can be refined to be more specific as needed.
Since a single tolerance entry may match multiple inputs, we ensure that the matching is unambiguous using the following rules:
If there is any ambiguity, the app stops with an error.
CLI Changes
This adds a new
checksubcommand to the runner CLI with the following options:check skeleton: Print a skeleton tolerance JSON file.check tolerances [NAME]: List tolerance kinds, or describe one by name. This is similar toinputs [NAME]check verify --tolerances <FILE> --input-file <FILE>: Validate a tolerance file against an input file. This runs up to step 3 of the checklist above and serves as a pre-flight check before any CI jobs are run. Errors with the setup that can be caught early will be and can thus save CI time.check run --tolerances <FILE> --input-file <FILE> --before <FILE> --after <FILE> [--output-file <FILE>]: Run regression checks.Benchmark Registration Changes
Regression checks are opt-in. Benchmarks that wish to opt-in implement
benchmark::Regressionand the singularcheckmethod. All logic for the before and after comparison lives in thecheckmethod. Such benchmarks also need to useregistry::Benchmarks::register_regressionto be correctly tracked as regression compatible. No independent registration of theToleranceassociated type is needed.That is it.
Note that check should not print anything out to
stdoutand instead communicate success/failure solely through its return type to avoid spamming the output.Example
You can see this in action in
diskann-benchmark-simd. RunDepending on the noise in your system, you will see something like the following (note that the
ciprofile should be used for more reliable measurements):Where this particular implementation has decided that a regression cannot be meaningfully detected if an execution time got rounded to zero.
Suggested Reviewing Order
diskann-benchmark-runnerbenchmark.rs: This contains the newRegressiontrait and the internal type-erasure machinery that is used inside the registry to mark a benchmark as regression-capable.registry.rs: The newregister_regressionAPI as well as methods for retrieving and interrogating regression capable benchmarks.internal/regression.rs: This is the meat and potatoes of this PR. This contains all the logic to perform the outlined steps and has a pretty good module-level documentation that outlines the approach taken.app.rs: The newChecksubcommand and routing, this is pretty straight-forward save for the changes to tests (see below).utils/num.rs: I added two new opinionated utilities to help with writing checks:relative_change: Compute the relative change between two values, handling corner cases.NonNegativeFinite: Useful to serde-compatible assertions that tolerance entries have the obvious properties the name implies.jobs.rs: The changes here are to mainly make it easier to interact with the JSON patterns used by the input files to have more uniform handling of said files.Changes to the testing infrastructure:
app.rshave been upgraded to handle tolerance checks as well. This involves having more magically escaped patterns for input/output files and supporting multi-linestdin.txtfiles to allow regression tests to set up their respective environment.Much of the logic in
internal/regression.rsis tested via the UX tests because (1) setting up a proper environment for this functionality is challenging and (2) the UX tests provide much better visual information on the expected behavior. A lot of new UX tests have been added.diskann-benchmark-simd:The changes here are largely meant to work as an example of adding regression tests. Since
diskann-benchmark-simdis not a production critical crate, feel free to just skim this or ignore entirely.Future Ideas
There are parts of this that are not perfect:
beforefiles and build a statistical picture of trends. That opens a whole can of worms like schema stability, machine stability (if comparing run times) etc. that are way beyond the scope of what's needed to get basic regression tests running.--verboseflag to print out all diagnostics rather than using the opinionated triage of errors then failures then successes.Disclaimer: This PR (though not the PR description) was written with the help of AI but has been heavily edited and is something that I wouldn't be annoyed at reviewing outside of it being +3500 lines (I am sorry).