[Feature] Introduce Knowledge Compiling Module by wangxingjun778 · Pull Request #160 · modelscope/sirchmunk

wangxingjun778 · 2026-04-14T17:44:02Z

🚀 New Features & Capabilities

Knowledge Compile Module: Introduced a new module for offline document processing.
Hierarchical Indexing: Converts documents into hierarchical tree indices and knowledge clusters to structure data effectively.
New CLI Command: Added a command-line interface entry point to trigger the knowledge compilation process.
Search Pipeline Integration: Integrated the generated artifacts (tree indices/clusters) into the existing search pipeline to enhance retrieval precision.

🛠️ Improvements & Optimizations

Health Check Utility: Added a linting utility to perform system health checks.
I/O Optimization: Implemented file hash reuse throughout the pipeline to reduce redundant I/O operations.
Configuration Flexibility: Removed hardcoded model names and processing limits, allowing for dynamic configuration.

🐛 Bug Fixes & Performance Tuning

Cross-Reference Performance: Addressed performance bottlenecks identified in the cross-reference building process.

gemini-code-assist

Code Review

This pull request introduces a 'Knowledge Compile' module, enabling offline document processing into hierarchical tree indices and knowledge clusters. It adds a new CLI command, a linting utility for health checks, and integrates these artifacts into the search pipeline for improved retrieval precision. Feedback focuses on addressing performance bottlenecks in cross-reference building, eliminating hardcoded model names and processing limits, and optimizing I/O by reusing file hashes throughout the pipeline.

gemini-code-assist · 2026-04-14T17:46:28Z

+        for i in range(len(cluster_ids)):
+            for j in range(i + 1, len(cluster_ids)):
+                cid_a, cid_b = cluster_ids[i], cluster_ids[j]
+                shared = cluster_to_files[cid_a] & cluster_to_files[cid_b]
+                if not shared:
+                    continue
+
+                pair_key = (min(cid_a, cid_b), max(cid_a, cid_b))
+                if pair_key in pairs_seen:
+                    continue
+                pairs_seen.add(pair_key)
+
+                weight = min(len(shared) * 0.25, 1.0)
+                c_a = await self._storage.get(cid_a)
+                c_b = await self._storage.get(cid_b)
+                if c_a and c_b:
+                    self._add_edge(c_a, cid_b, "co_occur", weight)
+                    self._add_edge(c_b, cid_a, "co_occur", weight)
+                    await self._storage.update(c_a)
+                    await self._storage.update(c_b)
+                    edges_created += 1


The _build_cross_references method implements an O(N^2) loop over cluster pairs, performing multiple asynchronous database operations (get, update) within the inner loop. This will lead to severe performance issues as the knowledge base grows. Consider batching these updates or using a more efficient graph construction strategy.

gemini-code-assist · 2026-04-14T17:46:29Z

+    llm = OpenAIChat(
+        base_url=os.getenv("LLM_BASE_URL", "https://api.openai.com/v1"),
+        api_key=llm_api_key,
+        model=os.getenv("LLM_MODEL_NAME", "gpt-5.2"),


The model name "gpt-5.2" is hardcoded here and in other command functions (_compile_status, _compile_lint). It should be centralized or made configurable via environment variables to avoid duplication and facilitate future updates.

gemini-code-assist · 2026-04-14T17:46:29Z

+    """Result of compiling a single file."""
+
+    path: str
+    tree: Optional[DocumentTree] = None


Add a file_hash field to FileCompileResult to allow passing the already-computed hash through the pipeline, avoiding redundant I/O.

Suggested change

tree: Optional[DocumentTree] = None

path: str

file_hash: str = ""

gemini-code-assist · 2026-04-14T17:46:29Z

+        When *shallow* is True (or file is ineligible for tree indexing),
+        the pipeline skips tree building and summarises via a direct LLM call.
+        """
+        result = FileCompileResult(path=entry.path)


Populate the file_hash in the result object using the hash already available in FileEntry.

Suggested change

result = FileCompileResult(path=entry.path)

result = FileCompileResult(path=entry.path, file_hash=entry.file_hash)

gemini-code-assist · 2026-04-14T17:46:29Z

+                    report.trees_built += 1
+                # Update manifest
+                manifest.files[result.path] = FileManifestEntry(
+                    file_hash=get_fast_hash(result.path) or "",


Use the file_hash from the result object instead of re-calculating it by reading the file again.

Suggested change

file_hash=get_fast_hash(result.path) or "",

file_hash=result.file_hash,

gemini-code-assist · 2026-04-14T17:46:29Z

+            from sirchmunk.learnings.tree_indexer import DocumentTree
+
+            trees: List[DocumentTree] = []
+            for tree_file in sorted(tree_cache.glob("*.json"))[:50]:


The number of tree indices processed during probing is hardcoded to 50. Similar hardcoded limits exist in _probe_compile_hints (50 clusters, 100 trees). In large environments, these limits significantly restrict the effectiveness of the knowledge network. Consider making these thresholds configurable.

gemini-code-assist · 2026-04-14T17:46:29Z

+
+    async def _check_clusters(self, report: LintReport, auto_fix: bool) -> None:
+        """Validate each knowledge cluster."""
+        all_clusters = await self._storage.find("", limit=10000)


The limit of 10,000 clusters for linting might be insufficient for very large knowledge bases. Consider using pagination or making the limit configurable.

…earch_wiki

… retrieval

- Add narrow quick ratio formula (Cash+STI+Receivables)/CL instead of broad (CA-Inventories)/CL — matches standard financial analysis - Add interest coverage ratio rule: negative EBIT → ratio = 0 - Strengthen yes/no answer format enforcement (MUST begin with Yes/No) - Add nature/composition guidance (describe proportions, not just totals) - Add listing completeness instruction - Refine rounding: 1dp for %, whole number for $≥10 in target unit, 2dp for $<10 in target unit - Add "use query formula if provided" precedence in data requirements Stable +3 improvements verified: Verizon quick ratio, AMCOR restructuring, Netflix unit conversion (all 3/4 correct across 4 benchmark runs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

wangxingjun778 added 9 commits April 13, 2026 14:40

bump version

3eb5354

Introduce Sirchmunk Learnings (insights from pageindex and LLM wiki)

b72a878

improve compile infer

c4f4b16

improve search pipeline for compile mode

6458477

fix and enhance llm wiki and tree index for FAST search

1f6f799

fix _extract_catalog_keywords for llm wiki

077be35

add tree guided sampling

a602197

fix compile quality and large-file processing

8233c35

adopt the latest compile processing

1de1c98

gemini-code-assist Bot reviewed Apr 14, 2026

View reviewed changes

wangxingjun778 added 20 commits April 15, 2026 16:02

refactor tree indexing with toc

938ced1

ok Merge branch 'main' of github.com:modelscope/sirchmunk into feat/s…

7b65b4b

…earch_wiki

enhance compile for excel and add embedding fallback for rga keywords…

29c0909

… retrieval

fix storage

d1f1fd4

add financebench

caf8e05

add llm judge for financebench

4a0a017

Adapt older knowledge cluster data structure

613c099

update finance bench readme

6858418

refactor config for finbench

9441ef2

refactor financebench readme

f1f86fa

update readme for finbench

0e46ef5

enhance tree indexes usage for search pipeline

2cf5c37

fix issues

c0b0db5

update tree index

e8184d0

update finbench readme

dc27ed9

update finbench readme

8723b85

update should answer thres

ca9a609

fix eval for finbench in runner

34c181e

refactor metrics as LLM judge for finbench

2b4714e

update config

a184e86

wangxingjun778 and others added 30 commits April 26, 2026 23:05

fix review

bdd8bdc

improve kreuzberg table extraction

d3b91d6

enhance compiler table extraction

63ed047

fix table extraction

b760119

improve compile for summary and table

e55ada7

fix tree index

fe351a1

update compiler

cb1ba96

improve compile efficiency

93d4a1f

improve compile mem usage

929fbc5

improve extractor multi-processing

207fe59

fix ProcessPoolExecutor

bbc2bbd

clean methods for compiler

5af51df

improve all corpus

af5f7e1

tree index and rga fusion

7439521

fallback hybrid tree indexing

cec209d

improve search pipeline for hybrid

59beaea

Add compile tree index for DEEP mode

ec6f6b1

improve DEEP

ec08cd8

add tree navi for react loop in DEEP mode

e8edb2f

update deep

5c359b2

refactor deep mode

9b34478

enhance search deep

f81b24f

fallback

7a0adf7

refine deep mode

5a55aa8

fix pipeline deep

d4a6366

refactor deep mode for tree indexing loop

436d888

improve search and prompts

7484f68

refine search

629f27a

fallback and improve prompts

9c53bb6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Introduce Knowledge Compiling Module#160

[Feature] Introduce Knowledge Compiling Module#160
wangxingjun778 wants to merge 71 commits into
mainfrom
feat/search_wiki

wangxingjun778 commented Apr 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	tree: Optional[DocumentTree] = None
	path: str
	file_hash: str = ""

	result = FileCompileResult(path=entry.path)
	result = FileCompileResult(path=entry.path, file_hash=entry.file_hash)

	file_hash=get_fast_hash(result.path) or "",
	file_hash=result.file_hash,

Conversation

wangxingjun778 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 New Features & Capabilities

🛠️ Improvements & Optimizations

🐛 Bug Fixes & Performance Tuning

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wangxingjun778 commented Apr 14, 2026 •

edited

Loading