fix(explore): keep Tier 0 code-first diversity for popular identifiers (#449)#457
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…449) Gate Tier 0 on code-language hit count instead of total posting-list length so queries where doc files dominate the word index still get the code-first pass, while all-code popular queries (issue-427) still fall through to Tier 1's hit-count sort. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d1ea27adf4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| var code_hit_count: usize = 0; | ||
| for (word_hits) |hit| { | ||
| const hp = self.word_index.hitPath(hit); | ||
| if (hp.len > 0 and !isDocLanguage(detectLanguage(hp))) code_hit_count += 1; | ||
| } |
There was a problem hiding this comment.
Short-circuit code-hit counting once over the Tier 0 gate
For popular identifiers that have more than max_results * 2 code hits, this loop still walks the entire posting list just to decide Tier 0 should be skipped, and Tier 1 immediately walks the same word_hits again to build hits_per_file. The old total-hit gate was an O(1) length check in this path, so large all-code or mostly-code queries now pay an extra full posting-list traversal before taking the same Tier 1 path; break as soon as code_hit_count exceeds the threshold to avoid regressing common searchContent calls.
Useful? React with 👍 / 👎.
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Summary
Fixes #449. Tier 0 of
searchContenthad aword_hits.len <= max_results * 2gate that skipped the whole Tier 0 code-first/doc-second diversity pass when a posting list got large. For popular identifiers like `fooBar`, that meant markdown files with many incidental mentions could fill `max_results` before any code file was scanned.Approach
Replace the total-hit-count gate with a code-language-only gate. The new check counts hits in code-language files specifically; when code hits stay within bounds, Tier 0's two-pass (code, then doc) runs even if total hits are large. When the population is all-code (the #427 scenario), Tier 1's existing hit-count sort takes over as before.
Test plan
zig build testpasses (519/519 including the newissue-449test).issue-427regression scenario still passes (verified manually).Commits
test: failing test for #449 (Tier 0 gate bypass)fix(explore): keep Tier 0 code/doc diversity for popular identifiers (#449)🤖 Generated with Claude Code