fix(explore): integrate nanoregex for correct regex matching (#454)#458
Merged
Conversation
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Replace the homegrown regex matcher with nanoregex (justrach/nanoregex), a pure-Zig Thompson-NFA/DFA engine with Python-re-compatible semantics. Key correctness fix: the old matcher silently treated \b as literal 'b' instead of a word-boundary assertion, causing false matches (issue #454). nanoregex correctly handles \b, \B, {n,m} quantifiers, and is immune to catastrophic backtracking on patterns like (a+)+b. Also patches a false-negative bug in nanoregex's extractLiteralPrefix prefilter: patterns like hel+o incorrectly computed "helo" as the literal prefix (skipping matches in haystacks like "helllo"). Fixed by making collectPrefix return a stop-signal bool so the concat loop halts after any quantified node even when that node extended the prefix. Performance: nanoregex is 4-6x faster than the homegrown matcher on common codedb_search shapes (literal, alternation, dot-star, char-class) due to DFA table-lookup hot path after warmup. Changes: - build.zig.zon: add nanoregex dependency - build.zig: wire nanoregex module into exe, tests, adversarial_tests - src/explore.zig: replace regexMatch + 7 helper functions (~300 lines) with nanoregex-backed implementations; swap two call sites to compile once per file rather than per line - src/tests.zig: add failing test for issue-454 word-boundary behaviour - zig-pkg/nanoregex-*/src/prefilter.zig: fix extractLiteralPrefix bug Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
src/bench.zig and src/benchmark.zig both import Explorer from explore.zig, which now requires the nanoregex module. CI's bench-regression workflow caught this. Also wire nanoregex into the wasm target for completeness (wasm build is broken on main anyway due to unrelated std API drift, but the import is consistent). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
76aac71 to
86db0a4
Compare
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #454. Replaces the ~300-line homegrown regex matcher with
justrach/nanoregex, a pure-Zig Thompson-NFA/DFA engine.Correctness wins
\b(word boundary) now works — previously it was silently treated as the escaped literalb, so\bfoo\bmatched nothing.{n,m}bounded quantifiers — previously unsupported.*?,+?) — previously parsed as greedy followed by literal?.(a+)+bcan't catastrophically backtrack.Performance
End-to-end in-process benchmark on codedb's own ~1.2MB source corpus:
The anchored case is sub-millisecond either way and isn't a typical workload. All non-anchored shapes are 2.7-4.3x faster.
Changes
build.zig.zon: add nanoregex dependencybuild.zig: wire nanoregex module intoexe,tests,adversarial_testssrc/explore.zig: remove 7 homegrown matcher functions (~300 lines), replaceregexMatchwith a thin nanoregex-backed shim, rewrite bothsearchInContent*Regexpaths to compile once per file rather than per linesrc/tests.zig: addissue-454failing test for\bextractLiteralPrefix(hel+owas computing prefix"helo"and skipping"helllo"). Worth pushing upstream to justrach/nanoregex.Test plan
zig build testpasses (520/520).regexMatch:tests still pass — full semantic parity for the previously-supported subset.issue-454test now passes.Notes
fn [a-zA-Z_]+(855 vs 847) — nanoregex is being stricter. A separate recall/precision audit against Python'sre(in progress) will confirm which is more correct.src/index.zig:2259(decomposeRegex) is independent and stays — it parses regex syntax to extract trigrams, not to match.🤖 Generated with Claude Code