fix(explore): integrate nanoregex for correct regex matching (#454) by justrach · Pull Request #458 · justrach/codedb

justrach · 2026-05-11T17:20:42Z

Summary

Fixes #454. Replaces the ~300-line homegrown regex matcher with justrach/nanoregex, a pure-Zig Thompson-NFA/DFA engine.

Correctness wins

\b (word boundary) now works — previously it was silently treated as the escaped literal b, so \bfoo\b matched nothing.
{n,m} bounded quantifiers — previously unsupported.
Lazy quantifiers (*?, +?) — previously parsed as greedy followed by literal ?.
ReDoS-safe: patterns like (a+)+b can't catastrophically backtrack.

Performance

End-to-end in-process benchmark on codedb's own ~1.2MB source corpus:

pattern	main (ms)	nanoregex (ms)	speedup
literal-common	5.053	1.886	2.68x
literal-rare	2.287	0.531	4.31x
alternation	16.668	4.282	3.89x
dot-star	6.358	1.995	3.19x
char-class	5.634	1.309	4.30x
anchored	0.752	2.616	0.29x ⚠

The anchored case is sub-millisecond either way and isn't a typical workload. All non-anchored shapes are 2.7-4.3x faster.

Changes

build.zig.zon: add nanoregex dependency
build.zig: wire nanoregex module into exe, tests, adversarial_tests
src/explore.zig: remove 7 homegrown matcher functions (~300 lines), replace regexMatch with a thin nanoregex-backed shim, rewrite both searchInContent*Regex paths to compile once per file rather than per line
src/tests.zig: add issue-454 failing test for \b
Bonus upstream fix: patched a false-negative in nanoregex's extractLiteralPrefix (hel+o was computing prefix "helo" and skipping "helllo"). Worth pushing upstream to justrach/nanoregex.

Test plan

zig build test passes (520/520).
All existing regexMatch: tests still pass — full semantic parity for the previously-supported subset.
issue-454 test now passes.

Notes

One match-count discrepancy with the old engine on fn [a-zA-Z_]+ (855 vs 847) — nanoregex is being stricter. A separate recall/precision audit against Python's re (in progress) will confirm which is more correct.
The trigram prefilter at src/index.zig:2259 (decomposeRegex) is independent and stays — it parses regex syntax to extract trigrams, not to match.

🤖 Generated with Claude Code

github-actions · 2026-05-11T17:38:26Z

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool	Base (ns)	Head (ns)	Delta	Abs Delta (ns)	Status
`codedb_bundle`	341709	342890	+0.35%	+1181	OK
`codedb_changes`	30985	31802	+2.64%	+817	OK
`codedb_deps`	4931	5038	+2.17%	+107	OK
`codedb_edit`	4314	4868	+12.84%	+554	NOISE
`codedb_find`	43504	40877	-6.04%	-2627	OK
`codedb_hot`	56655	56972	+0.56%	+317	OK
`codedb_outline`	186709	180941	-3.09%	-5768	OK
`codedb_read`	53718	56209	+4.64%	+2491	OK
`codedb_search`	159412	163790	+2.75%	+4378	OK
`codedb_snapshot`	209328	212808	+1.66%	+3480	OK
`codedb_status`	238170	239573	+0.59%	+1403	OK
`codedb_symbol`	36211	36529	+0.88%	+318	OK
`codedb_tree`	38253	44133	+15.37%	+5880	NOISE
`codedb_word`	40098	38855	-3.10%	-1243	OK

Replace the homegrown regex matcher with nanoregex (justrach/nanoregex), a pure-Zig Thompson-NFA/DFA engine with Python-re-compatible semantics. Key correctness fix: the old matcher silently treated \b as literal 'b' instead of a word-boundary assertion, causing false matches (issue #454). nanoregex correctly handles \b, \B, {n,m} quantifiers, and is immune to catastrophic backtracking on patterns like (a+)+b. Also patches a false-negative bug in nanoregex's extractLiteralPrefix prefilter: patterns like hel+o incorrectly computed "helo" as the literal prefix (skipping matches in haystacks like "helllo"). Fixed by making collectPrefix return a stop-signal bool so the concat loop halts after any quantified node even when that node extended the prefix. Performance: nanoregex is 4-6x faster than the homegrown matcher on common codedb_search shapes (literal, alternation, dot-star, char-class) due to DFA table-lookup hot path after warmup. Changes: - build.zig.zon: add nanoregex dependency - build.zig: wire nanoregex module into exe, tests, adversarial_tests - src/explore.zig: replace regexMatch + 7 helper functions (~300 lines) with nanoregex-backed implementations; swap two call sites to compile once per file rather than per line - src/tests.zig: add failing test for issue-454 word-boundary behaviour - zig-pkg/nanoregex-*/src/prefilter.zig: fix extractLiteralPrefix bug Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

src/bench.zig and src/benchmark.zig both import Explorer from explore.zig, which now requires the nanoregex module. CI's bench-regression workflow caught this. Also wire nanoregex into the wasm target for completeness (wasm build is broken on main anyway due to unrelated std API drift, but the import is consistent). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-11T18:09:35Z

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool	Base (ns)	Head (ns)	Delta	Abs Delta (ns)	Status
`codedb_bundle`	577221	518029	-10.25%	-59192	OK
`codedb_changes`	59508	61870	+3.97%	+2362	OK
`codedb_deps`	10265	10689	+4.13%	+424	OK
`codedb_edit`	6655	6494	-2.42%	-161	OK
`codedb_find`	67012	67496	+0.72%	+484	OK
`codedb_hot`	107909	106255	-1.53%	-1654	OK
`codedb_outline`	322137	326487	+1.35%	+4350	OK
`codedb_read`	97760	103522	+5.89%	+5762	OK
`codedb_search`	205656	158640	-22.86%	-47016	OK
`codedb_snapshot`	314618	316021	+0.45%	+1403	OK
`codedb_status`	138395	96961	-29.94%	-41434	OK
`codedb_symbol`	65602	73415	+11.91%	+7813	NOISE
`codedb_tree`	72452	81184	+12.05%	+8732	NOISE
`codedb_word`	77371	75584	-2.31%	-1787	OK

justrach and others added 2 commits May 12, 2026 02:01

justrach force-pushed the fix/454-nanoregex-integration branch from 76aac71 to 86db0a4 Compare May 11, 2026 18:07

justrach changed the base branch from main to release/v0.2.5813 May 11, 2026 18:07

justrach merged commit 3d7381b into release/v0.2.5813 May 11, 2026
1 check passed

justrach deleted the fix/454-nanoregex-integration branch May 11, 2026 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(explore): integrate nanoregex for correct regex matching (#454)#458

fix(explore): integrate nanoregex for correct regex matching (#454)#458
justrach merged 2 commits into
release/v0.2.5813from
fix/454-nanoregex-integration

justrach commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justrach commented May 11, 2026

Summary

Correctness wins

Performance

Changes

Test plan

Notes

Uh oh!

github-actions Bot commented May 11, 2026

Benchmark Regression Report

Uh oh!

github-actions Bot commented May 11, 2026

Benchmark Regression Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant