fix(explore): surface skip-trigram files >64KB for common identifiers (#447, #451)#456
fix(explore): surface skip-trigram files >64KB for common identifiers (#447, #451)#456justrach wants to merge 2 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, #451) Files >64KB skip trigram indexing and land in skip_trigram_files. Pre-fix, searchContent deferred these to Tier 3 which never ran when Tier 1 (trigram candidates) already filled max_results — making the canonical definition site invisible. Same bug existed in searchContentWithScope. Fix: build word-hit counts per file and merge skip_trigram_files paths that have word-index hits into the Tier 1 candidate pool, sorted by hit count desc alongside trigram candidates. This ensures definition-dense files (high word hit counts) surface even when max_results fills during Tier 1 traversal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6d3fa0e325
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| var skip_iter_t1 = self.skip_trigram_files.keyIterator(); | ||
| while (skip_iter_t1.next()) |key_ptr| { | ||
| if (hits_per_file.contains(key_ptr.*)) { | ||
| try combined.append(allocator, key_ptr.*); |
There was a problem hiding this comment.
Avoid scanning every skipped file per search
In repositories with many files above the trigram limit, this loop makes every searchContent call walk the entire skip_trigram_files map just to find the few paths that also have word hits, even when the normal trigram candidates are sufficient. searchContent is the benchmarked query path, so this adds O(number of large skipped files) work before the fast path can return; consider iterating the word_hits paths and checking membership in skip_trigram_files instead.
Useful? React with 👍 / 👎.
…ces) ab8f7cd's Tier 0 rewrite already fixes #447 implicitly — Tier 0 now builds candidates directly from word_index.search, which captures hits in skip_trigram_files alongside fully-indexed files. The new test pins this behavior so a future Tier 0 refactor cannot silently regress. (PR #456's structural fix on top of the old code is obviated; closing.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Closing — #447 and #451 are already fixed implicitly by |
Summary
searchContentwas invisible to canonical definition sites in files >64KB. Those files land inskip_trigram_files(perwatcher.zig:446) and were only reachable via Tier 3, which ran AFTER Tier 1 had already filledmax_results.searchContentWithScope.codedb_callersalso uses this path so call sites in large files benefit.Approach
Build a per-file word-hit count from
word_index.search(query), then mergeskip_trigram_filespaths that appear in that count map into the Tier 1 candidate pool. The existing sort (word-hit count desc) places definition-dense large files ahead of small files with incidental mentions. Skip-trigram files with zero word hits still fall through to Tier 3 (unchanged).Test plan
zig build testpasses (520/520 including newissue-447andissue-451tests).Explorernow surfacessrc/explore.zig(233KB, 85 occurrences) incodedb_search Explorer.Commits
test: failing tests for #447 and #451 (skip-trigram invisibility)fix(explore): merge skip_trigram_files into Tier 1 candidate pool (#447, #451)🤖 Generated with Claude Code