[Feature] Introduce Knowledge Compiling Module#160
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a 'Knowledge Compile' module, enabling offline document processing into hierarchical tree indices and knowledge clusters. It adds a new CLI command, a linting utility for health checks, and integrates these artifacts into the search pipeline for improved retrieval precision. Feedback focuses on addressing performance bottlenecks in cross-reference building, eliminating hardcoded model names and processing limits, and optimizing I/O by reusing file hashes throughout the pipeline.
| for i in range(len(cluster_ids)): | ||
| for j in range(i + 1, len(cluster_ids)): | ||
| cid_a, cid_b = cluster_ids[i], cluster_ids[j] | ||
| shared = cluster_to_files[cid_a] & cluster_to_files[cid_b] | ||
| if not shared: | ||
| continue | ||
|
|
||
| pair_key = (min(cid_a, cid_b), max(cid_a, cid_b)) | ||
| if pair_key in pairs_seen: | ||
| continue | ||
| pairs_seen.add(pair_key) | ||
|
|
||
| weight = min(len(shared) * 0.25, 1.0) | ||
| c_a = await self._storage.get(cid_a) | ||
| c_b = await self._storage.get(cid_b) | ||
| if c_a and c_b: | ||
| self._add_edge(c_a, cid_b, "co_occur", weight) | ||
| self._add_edge(c_b, cid_a, "co_occur", weight) | ||
| await self._storage.update(c_a) | ||
| await self._storage.update(c_b) | ||
| edges_created += 1 |
There was a problem hiding this comment.
The _build_cross_references method implements an O(N^2) loop over cluster pairs, performing multiple asynchronous database operations (get, update) within the inner loop. This will lead to severe performance issues as the knowledge base grows. Consider batching these updates or using a more efficient graph construction strategy.
| llm = OpenAIChat( | ||
| base_url=os.getenv("LLM_BASE_URL", "https://api.openai.com/v1"), | ||
| api_key=llm_api_key, | ||
| model=os.getenv("LLM_MODEL_NAME", "gpt-5.2"), |
| """Result of compiling a single file.""" | ||
|
|
||
| path: str | ||
| tree: Optional[DocumentTree] = None |
| When *shallow* is True (or file is ineligible for tree indexing), | ||
| the pipeline skips tree building and summarises via a direct LLM call. | ||
| """ | ||
| result = FileCompileResult(path=entry.path) |
| report.trees_built += 1 | ||
| # Update manifest | ||
| manifest.files[result.path] = FileManifestEntry( | ||
| file_hash=get_fast_hash(result.path) or "", |
| from sirchmunk.learnings.tree_indexer import DocumentTree | ||
|
|
||
| trees: List[DocumentTree] = [] | ||
| for tree_file in sorted(tree_cache.glob("*.json"))[:50]: |
There was a problem hiding this comment.
The number of tree indices processed during probing is hardcoded to 50. Similar hardcoded limits exist in _probe_compile_hints (50 clusters, 100 trees). In large environments, these limits significantly restrict the effectiveness of the knowledge network. Consider making these thresholds configurable.
|
|
||
| async def _check_clusters(self, report: LintReport, auto_fix: bool) -> None: | ||
| """Validate each knowledge cluster.""" | ||
| all_clusters = await self._storage.find("", limit=10000) |
- Add narrow quick ratio formula (Cash+STI+Receivables)/CL instead of broad (CA-Inventories)/CL — matches standard financial analysis - Add interest coverage ratio rule: negative EBIT → ratio = 0 - Strengthen yes/no answer format enforcement (MUST begin with Yes/No) - Add nature/composition guidance (describe proportions, not just totals) - Add listing completeness instruction - Refine rounding: 1dp for %, whole number for $≥10 in target unit, 2dp for $<10 in target unit - Add "use query formula if provided" precedence in data requirements Stable +3 improvements verified: Verizon quick ratio, AMCOR restructuring, Netflix unit conversion (all 3/4 correct across 4 benchmark runs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
🚀 New Features & Capabilities
🛠️ Improvements & Optimizations
🐛 Bug Fixes & Performance Tuning