Skip to content

fix(explore): integrate nanoregex for correct regex matching (#454)#458

Merged
justrach merged 2 commits into
release/v0.2.5813from
fix/454-nanoregex-integration
May 11, 2026
Merged

fix(explore): integrate nanoregex for correct regex matching (#454)#458
justrach merged 2 commits into
release/v0.2.5813from
fix/454-nanoregex-integration

Conversation

@justrach
Copy link
Copy Markdown
Owner

Summary

Fixes #454. Replaces the ~300-line homegrown regex matcher with justrach/nanoregex, a pure-Zig Thompson-NFA/DFA engine.

Correctness wins

  • \b (word boundary) now works — previously it was silently treated as the escaped literal b, so \bfoo\b matched nothing.
  • {n,m} bounded quantifiers — previously unsupported.
  • Lazy quantifiers (*?, +?) — previously parsed as greedy followed by literal ?.
  • ReDoS-safe: patterns like (a+)+b can't catastrophically backtrack.

Performance

End-to-end in-process benchmark on codedb's own ~1.2MB source corpus:

pattern main (ms) nanoregex (ms) speedup
literal-common 5.053 1.886 2.68x
literal-rare 2.287 0.531 4.31x
alternation 16.668 4.282 3.89x
dot-star 6.358 1.995 3.19x
char-class 5.634 1.309 4.30x
anchored 0.752 2.616 0.29x ⚠

The anchored case is sub-millisecond either way and isn't a typical workload. All non-anchored shapes are 2.7-4.3x faster.

Changes

  • build.zig.zon: add nanoregex dependency
  • build.zig: wire nanoregex module into exe, tests, adversarial_tests
  • src/explore.zig: remove 7 homegrown matcher functions (~300 lines), replace regexMatch with a thin nanoregex-backed shim, rewrite both searchInContent*Regex paths to compile once per file rather than per line
  • src/tests.zig: add issue-454 failing test for \b
  • Bonus upstream fix: patched a false-negative in nanoregex's extractLiteralPrefix (hel+o was computing prefix "helo" and skipping "helllo"). Worth pushing upstream to justrach/nanoregex.

Test plan

  • zig build test passes (520/520).
  • All existing regexMatch: tests still pass — full semantic parity for the previously-supported subset.
  • issue-454 test now passes.

Notes

  • One match-count discrepancy with the old engine on fn [a-zA-Z_]+ (855 vs 847) — nanoregex is being stricter. A separate recall/precision audit against Python's re (in progress) will confirm which is more correct.
  • The trigram prefilter at src/index.zig:2259 (decomposeRegex) is independent and stays — it parses regex syntax to extract trigrams, not to match.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 341709 342890 +0.35% +1181 OK
codedb_changes 30985 31802 +2.64% +817 OK
codedb_deps 4931 5038 +2.17% +107 OK
codedb_edit 4314 4868 +12.84% +554 NOISE
codedb_find 43504 40877 -6.04% -2627 OK
codedb_hot 56655 56972 +0.56% +317 OK
codedb_outline 186709 180941 -3.09% -5768 OK
codedb_read 53718 56209 +4.64% +2491 OK
codedb_search 159412 163790 +2.75% +4378 OK
codedb_snapshot 209328 212808 +1.66% +3480 OK
codedb_status 238170 239573 +0.59% +1403 OK
codedb_symbol 36211 36529 +0.88% +318 OK
codedb_tree 38253 44133 +15.37% +5880 NOISE
codedb_word 40098 38855 -3.10% -1243 OK

justrach and others added 2 commits May 12, 2026 02:01
Replace the homegrown regex matcher with nanoregex (justrach/nanoregex),
a pure-Zig Thompson-NFA/DFA engine with Python-re-compatible semantics.

Key correctness fix: the old matcher silently treated \b as literal 'b'
instead of a word-boundary assertion, causing false matches (issue #454).
nanoregex correctly handles \b, \B, {n,m} quantifiers, and is immune to
catastrophic backtracking on patterns like (a+)+b.

Also patches a false-negative bug in nanoregex's extractLiteralPrefix
prefilter: patterns like hel+o incorrectly computed "helo" as the literal
prefix (skipping matches in haystacks like "helllo"). Fixed by making
collectPrefix return a stop-signal bool so the concat loop halts after any
quantified node even when that node extended the prefix.

Performance: nanoregex is 4-6x faster than the homegrown matcher on common
codedb_search shapes (literal, alternation, dot-star, char-class) due to
DFA table-lookup hot path after warmup.

Changes:
- build.zig.zon: add nanoregex dependency
- build.zig: wire nanoregex module into exe, tests, adversarial_tests
- src/explore.zig: replace regexMatch + 7 helper functions (~300 lines)
  with nanoregex-backed implementations; swap two call sites to compile
  once per file rather than per line
- src/tests.zig: add failing test for issue-454 word-boundary behaviour
- zig-pkg/nanoregex-*/src/prefilter.zig: fix extractLiteralPrefix bug

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
src/bench.zig and src/benchmark.zig both import Explorer from
explore.zig, which now requires the nanoregex module. CI's
bench-regression workflow caught this. Also wire nanoregex into the
wasm target for completeness (wasm build is broken on main anyway
due to unrelated std API drift, but the import is consistent).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@justrach justrach force-pushed the fix/454-nanoregex-integration branch from 76aac71 to 86db0a4 Compare May 11, 2026 18:07
@justrach justrach changed the base branch from main to release/v0.2.5813 May 11, 2026 18:07
@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 577221 518029 -10.25% -59192 OK
codedb_changes 59508 61870 +3.97% +2362 OK
codedb_deps 10265 10689 +4.13% +424 OK
codedb_edit 6655 6494 -2.42% -161 OK
codedb_find 67012 67496 +0.72% +484 OK
codedb_hot 107909 106255 -1.53% -1654 OK
codedb_outline 322137 326487 +1.35% +4350 OK
codedb_read 97760 103522 +5.89% +5762 OK
codedb_search 205656 158640 -22.86% -47016 OK
codedb_snapshot 314618 316021 +0.45% +1403 OK
codedb_status 138395 96961 -29.94% -41434 OK
codedb_symbol 65602 73415 +11.91% +7813 NOISE
codedb_tree 72452 81184 +12.05% +8732 NOISE
codedb_word 77371 75584 -2.31% -1787 OK

@justrach justrach merged commit 3d7381b into release/v0.2.5813 May 11, 2026
1 check passed
@justrach justrach deleted the fix/454-nanoregex-integration branch May 11, 2026 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

regex: integrate nanoregex to add \b, {n,m}, lazy quants, and ReDoS-safe matching

1 participant