fix(explore): surface skip-trigram files >64KB for common identifiers (#447, #451) by justrach · Pull Request #456 · justrach/codedb

justrach · 2026-05-11T17:19:48Z

Summary

Fixes #447: searchContent was invisible to canonical definition sites in files >64KB. Those files land in skip_trigram_files (per watcher.zig:446) and were only reachable via Tier 3, which ran AFTER Tier 1 had already filled max_results.
Fixes #451: same bug, mirrored in searchContentWithScope. codedb_callers also uses this path so call sites in large files benefit.

Approach

Build a per-file word-hit count from word_index.search(query), then merge skip_trigram_files paths that appear in that count map into the Tier 1 candidate pool. The existing sort (word-hit count desc) places definition-dense large files ahead of small files with incidental mentions. Skip-trigram files with zero word hits still fall through to Tier 3 (unchanged).

Test plan

zig build test passes (520/520 including new issue-447 and issue-451 tests).
Explorer now surfaces src/explore.zig (233KB, 85 occurrences) in codedb_search Explorer.

Commits

test: failing tests for #447 and #451 (skip-trigram invisibility)
fix(explore): merge skip_trigram_files into Tier 1 candidate pool (#447, #451)

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…, #451) Files >64KB skip trigram indexing and land in skip_trigram_files. Pre-fix, searchContent deferred these to Tier 3 which never ran when Tier 1 (trigram candidates) already filled max_results — making the canonical definition site invisible. Same bug existed in searchContentWithScope. Fix: build word-hit counts per file and merge skip_trigram_files paths that have word-index hits into the Tier 1 candidate pool, sorted by hit count desc alongside trigram candidates. This ensures definition-dense files (high word hit counts) surface even when max_results fills during Tier 1 traversal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-11T17:22:12Z

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool	Base (ns)	Head (ns)	Delta	Abs Delta (ns)	Status
`codedb_bundle`	586460	611011	+4.19%	+24551	OK
`codedb_changes`	59910	64049	+6.91%	+4139	OK
`codedb_deps`	11313	9955	-12.00%	-1358	OK
`codedb_edit`	7824	6221	-20.49%	-1603	OK
`codedb_find`	67447	68656	+1.79%	+1209	OK
`codedb_hot`	111887	109066	-2.52%	-2821	OK
`codedb_outline`	325470	312754	-3.91%	-12716	OK
`codedb_read`	108679	99892	-8.09%	-8787	OK
`codedb_search`	211073	213570	+1.18%	+2497	OK
`codedb_snapshot`	322528	319893	-0.82%	-2635	OK
`codedb_status`	151376	123272	-18.57%	-28104	OK
`codedb_symbol`	67901	63741	-6.13%	-4160	OK
`codedb_tree`	75055	72526	-3.37%	-2529	OK
`codedb_word`	82399	79817	-3.13%	-2582	OK

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d3fa0e325

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-11T17:22:25Z

+            var skip_iter_t1 = self.skip_trigram_files.keyIterator();
+            while (skip_iter_t1.next()) |key_ptr| {
+                if (hits_per_file.contains(key_ptr.*)) {
+                    try combined.append(allocator, key_ptr.*);


Avoid scanning every skipped file per search

In repositories with many files above the trigram limit, this loop makes every searchContent call walk the entire skip_trigram_files map just to find the few paths that also have word hits, even when the normal trigram candidates are sufficient. searchContent is the benchmarked query path, so this adds O(number of large skipped files) work before the fast path can return; consider iterating the word_hits paths and checking membership in skip_trigram_files instead.

Useful? React with 👍 / 👎.

…ces) ab8f7cd's Tier 0 rewrite already fixes #447 implicitly — Tier 0 now builds candidates directly from word_index.search, which captures hits in skip_trigram_files alongside fully-indexed files. The new test pins this behavior so a future Tier 0 refactor cannot silently regress. (PR #456's structural fix on top of the old code is obviated; closing.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

justrach · 2026-05-11T18:55:15Z

Closing — #447 and #451 are already fixed implicitly by ab8f7cd on release/v0.2.5813. That rewrite restructured Tier 0 to build candidates directly from word_index.search, which captures hits in skip_trigram_files alongside fully-indexed files (see the comment at src/explore.zig:1573). A regression test for #447 was added on the release branch in c75b574; the #451 test was already added in ab8f7cd.

justrach and others added 2 commits May 12, 2026 00:40

test: failing tests for #447 and #451 (skip-trigram invisibility)

930fc77

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

justrach closed this May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(explore): surface skip-trigram files >64KB for common identifiers (#447, #451)#456

fix(explore): surface skip-trigram files >64KB for common identifiers (#447, #451)#456
justrach wants to merge 2 commits into
mainfrom
fix/447-451-skip-trigram-canonical

justrach commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Uh oh!

justrach commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justrach commented May 11, 2026

Summary

Approach

Test plan

Commits

Uh oh!

github-actions Bot commented May 11, 2026

Benchmark Regression Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

justrach commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant