Skip to content

fix(explore): surface skip-trigram files >64KB for common identifiers (#447, #451)#456

Closed
justrach wants to merge 2 commits into
mainfrom
fix/447-451-skip-trigram-canonical
Closed

fix(explore): surface skip-trigram files >64KB for common identifiers (#447, #451)#456
justrach wants to merge 2 commits into
mainfrom
fix/447-451-skip-trigram-canonical

Conversation

@justrach
Copy link
Copy Markdown
Owner

Summary

  • Fixes #447: searchContent was invisible to canonical definition sites in files >64KB. Those files land in skip_trigram_files (per watcher.zig:446) and were only reachable via Tier 3, which ran AFTER Tier 1 had already filled max_results.
  • Fixes #451: same bug, mirrored in searchContentWithScope. codedb_callers also uses this path so call sites in large files benefit.

Approach

Build a per-file word-hit count from word_index.search(query), then merge skip_trigram_files paths that appear in that count map into the Tier 1 candidate pool. The existing sort (word-hit count desc) places definition-dense large files ahead of small files with incidental mentions. Skip-trigram files with zero word hits still fall through to Tier 3 (unchanged).

Test plan

  • zig build test passes (520/520 including new issue-447 and issue-451 tests).
  • Explorer now surfaces src/explore.zig (233KB, 85 occurrences) in codedb_search Explorer.

Commits

  1. test: failing tests for #447 and #451 (skip-trigram invisibility)
  2. fix(explore): merge skip_trigram_files into Tier 1 candidate pool (#447, #451)

🤖 Generated with Claude Code

justrach and others added 2 commits May 12, 2026 00:40
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, #451)

Files >64KB skip trigram indexing and land in skip_trigram_files. Pre-fix,
searchContent deferred these to Tier 3 which never ran when Tier 1 (trigram
candidates) already filled max_results — making the canonical definition site
invisible. Same bug existed in searchContentWithScope.

Fix: build word-hit counts per file and merge skip_trigram_files paths that
have word-index hits into the Tier 1 candidate pool, sorted by hit count desc
alongside trigram candidates. This ensures definition-dense files (high word
hit counts) surface even when max_results fills during Tier 1 traversal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 586460 611011 +4.19% +24551 OK
codedb_changes 59910 64049 +6.91% +4139 OK
codedb_deps 11313 9955 -12.00% -1358 OK
codedb_edit 7824 6221 -20.49% -1603 OK
codedb_find 67447 68656 +1.79% +1209 OK
codedb_hot 111887 109066 -2.52% -2821 OK
codedb_outline 325470 312754 -3.91% -12716 OK
codedb_read 108679 99892 -8.09% -8787 OK
codedb_search 211073 213570 +1.18% +2497 OK
codedb_snapshot 322528 319893 -0.82% -2635 OK
codedb_status 151376 123272 -18.57% -28104 OK
codedb_symbol 67901 63741 -6.13% -4160 OK
codedb_tree 75055 72526 -3.37% -2529 OK
codedb_word 82399 79817 -3.13% -2582 OK

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d3fa0e325

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/explore.zig
Comment on lines +1664 to +1667
var skip_iter_t1 = self.skip_trigram_files.keyIterator();
while (skip_iter_t1.next()) |key_ptr| {
if (hits_per_file.contains(key_ptr.*)) {
try combined.append(allocator, key_ptr.*);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid scanning every skipped file per search

In repositories with many files above the trigram limit, this loop makes every searchContent call walk the entire skip_trigram_files map just to find the few paths that also have word hits, even when the normal trigram candidates are sufficient. searchContent is the benchmarked query path, so this adds O(number of large skipped files) work before the fast path can return; consider iterating the word_hits paths and checking membership in skip_trigram_files instead.

Useful? React with 👍 / 👎.

justrach added a commit that referenced this pull request May 11, 2026
…ces)

ab8f7cd's Tier 0 rewrite already fixes #447 implicitly — Tier 0 now
builds candidates directly from word_index.search, which captures hits
in skip_trigram_files alongside fully-indexed files. The new test pins
this behavior so a future Tier 0 refactor cannot silently regress.

(PR #456's structural fix on top of the old code is obviated; closing.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@justrach
Copy link
Copy Markdown
Owner Author

Closing — #447 and #451 are already fixed implicitly by ab8f7cd on release/v0.2.5813. That rewrite restructured Tier 0 to build candidates directly from word_index.search, which captures hits in skip_trigram_files alongside fully-indexed files (see the comment at src/explore.zig:1573). A regression test for #447 was added on the release branch in c75b574; the #451 test was already added in ab8f7cd.

@justrach justrach closed this May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant