Skip to content

Knowledge base ext#281

Draft
sroussey wants to merge 3 commits intomainfrom
knowledge-base-ext
Draft

Knowledge base ext#281
sroussey wants to merge 3 commits intomainfrom
knowledge-base-ext

Conversation

@sroussey
Copy link
Copy Markdown
Collaborator

@sroussey sroussey commented Mar 5, 2026

No description provided.

sroussey and others added 3 commits March 4, 2026 23:12
…switch from the name datasets

- Replaced the deprecated `@workglow/dataset` with the new `@workglow/knowledge-base` package across various modules, including updates to package.json, README, and task schemas.
- Consolidated document and chunk management under the new KnowledgeBase architecture, enhancing the handling of documents and chunks in RAG workflows.
- Updated import paths and dependencies in multiple files to reflect the transition to the new package structure.
- Removed the old dataset package and its related files, streamlining the codebase.
- Added support for shared-table mode in the KnowledgeBase, allowing multiple knowledge bases to share the same underlying storage tables, reducing table proliferation.
- Implemented `ScopedTabularStorage` and `ScopedVectorStorage` wrappers to manage data partitioning by `kb_id`.
- Updated documentation to include new shared-table features and examples for setting up shared storage.
- Enhanced format annotations and task schemas to accommodate the new storage structure.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR renames the former @workglow/dataset package to @workglow/knowledge-base, updates imports/docs throughout the monorepo, and extends the knowledge base system with persistent registry metadata plus shared-table multi-tenant storage helpers.

Changes:

  • Replace @workglow/dataset dependency/exports/usages with @workglow/knowledge-base across packages, examples, and docs.
  • Make registerKnowledgeBase() asynchronous and persist KB metadata via a new KnowledgeBaseRepository.
  • Add shared-table mode support via Shared* schemas and ScopedTabularStorage / ScopedVectorStorage wrappers.

Reviewed changes

Copilot reviewed 54 out of 69 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
packages/workglow/src/common.ts Re-export knowledge-base package instead of dataset.
packages/workglow/package.json Dependency rename to @workglow/knowledge-base.
packages/workglow/README.md Package list updated to knowledge-base.
packages/test/src/test/util/Document.test.ts Test imports switched to knowledge-base.
packages/test/src/test/task-graph/InputResolver.test.ts Test imports switched to knowledge-base.
packages/test/src/test/rag/StructuralParser.test.ts Test imports switched to knowledge-base.
packages/test/src/test/rag/RagWorkflow.integration.test.ts Test imports switched to knowledge-base.
packages/test/src/test/rag/HybridSearchTask.test.ts Import rename + await registerKnowledgeBase.
packages/test/src/test/rag/HierarchicalChunker.test.ts Test imports switched to knowledge-base.
packages/test/src/test/rag/FullChain.test.ts Test imports switched to knowledge-base.
packages/test/src/test/rag/EndToEnd.integration.test.ts Test imports switched to knowledge-base.
packages/test/src/test/rag/DocumentRepository.test.ts Test imports switched to knowledge-base.
packages/test/src/test/rag/DocumentChunkUpsertTask.test.ts Import rename + await registerKnowledgeBase.
packages/test/src/test/rag/DocumentChunkSearchTask.test.ts Import rename + await registerKnowledgeBase.
packages/test/src/test/rag/DocumentChunkRetrievalTask.test.ts Import rename + await registerKnowledgeBase.
packages/test/src/test/rag/Document.test.ts Test imports switched to knowledge-base.
packages/test/src/test/rag/ChunkToVector.test.ts Test imports switched to knowledge-base.
packages/test/package.json Peer/dev deps renamed to knowledge-base.
packages/task-graph/src/task/InputResolver.ts Removes outdated resolver-format comment.
packages/storage/src/vector/README.md Docs updated to reference KnowledgeBase integration.
packages/storage/README.md Removes outdated repository schema examples.
packages/knowledge-base/tsconfig.json New tsconfig for the renamed package.
packages/knowledge-base/src/util/DatasetSchema.ts Updates TypeKnowledgeBase format to knowledge-base.
packages/knowledge-base/src/types.ts New entrypoint re-export.
packages/knowledge-base/src/node.ts New entrypoint re-export.
packages/knowledge-base/src/knowledge-base/createKnowledgeBase.ts Adds title/description options + awaits KB registration.
packages/knowledge-base/src/knowledge-base/SharedTableSchemas.ts Adds shared-table schemas + index definitions.
packages/knowledge-base/src/knowledge-base/ScopedVectorStorage.ts Adds KB-scoped wrapper for shared vector storage.
packages/knowledge-base/src/knowledge-base/ScopedTabularStorage.ts Adds KB-scoped wrapper for shared tabular storage.
packages/knowledge-base/src/knowledge-base/KnowledgeBaseSchema.ts Adds KB metadata schema + shared-table helpers + table naming.
packages/knowledge-base/src/knowledge-base/KnowledgeBaseRepository.ts New repository abstraction for KB metadata persistence.
packages/knowledge-base/src/knowledge-base/KnowledgeBaseRegistry.ts Async KB registration + persistent record creation + input resolver.
packages/knowledge-base/src/knowledge-base/KnowledgeBase.ts Adds title/description fields to KB instances.
packages/knowledge-base/src/knowledge-base/InMemoryKnowledgeBaseRepository.ts In-memory implementation of KB metadata repository.
packages/knowledge-base/src/document/StructuralParser.ts New StructuralParser implementation in knowledge-base package.
packages/knowledge-base/src/document/DocumentStorageSchema.ts New document tabular schema/types.
packages/knowledge-base/src/document/DocumentSchema.ts New document node schemas/types.
packages/knowledge-base/src/document/DocumentNode.ts New document-tree helpers (tokens, traversal, ranges).
packages/knowledge-base/src/document/Document.ts New Document class implementation.
packages/knowledge-base/src/common.ts Expands exports (repository, schemas, scoped/shared helpers).
packages/knowledge-base/src/common-server.ts New server entrypoint re-export.
packages/knowledge-base/src/chunk/ChunkVectorStorageSchema.ts New chunk vector storage schema/types.
packages/knowledge-base/src/chunk/ChunkSchema.ts New unified ChunkRecord schema/type.
packages/knowledge-base/src/bun.ts New Bun entrypoint.
packages/knowledge-base/src/browser.ts New browser entrypoint.
packages/knowledge-base/package.json Renames package to @workglow/knowledge-base.
packages/knowledge-base/README.md Docs updated + adds shared-table mode documentation.
packages/knowledge-base/LICENSE Adds Apache 2.0 license file to the package.
packages/knowledge-base/CHANGELOG.md Renames changelog header to knowledge-base.
packages/dataset/src/knowledge-base/KnowledgeBaseRegistry.ts Removes old dataset KB registry implementation.
packages/ai/tsconfig.json Removes baseUrl configuration.
packages/ai/src/task/StructuralParserTask.ts Imports switched to knowledge-base.
packages/ai/src/task/HierarchyJoinTask.ts Imports switched to knowledge-base.
packages/ai/src/task/HierarchicalChunkerTask.ts Imports switched to knowledge-base.
packages/ai/src/task/DocumentEnricherTask.ts Imports switched to knowledge-base.
packages/ai/src/task/ContextBuilderTask.ts Imports switched to knowledge-base.
packages/ai/src/task/ChunkVectorUpsertTask.ts Imports switched to knowledge-base.
packages/ai/src/task/ChunkVectorSearchTask.ts Imports switched + unused context renamed to _context.
packages/ai/src/task/ChunkVectorHybridSearchTask.ts Imports switched to knowledge-base.
packages/ai/src/task/ChunkToVectorTask.ts Imports switched to knowledge-base.
packages/ai/src/task/ChunkRetrievalTask.ts Imports switched to knowledge-base.
packages/ai/package.json Peer/dev deps renamed to knowledge-base.
packages/ai/README.md Docs updated for knowledge-base + supported format list updated.
examples/web/vite.config.js Workspace package allowlist updated to knowledge-base.
examples/web/package.json Dependency renamed to knowledge-base.
docs/developers/03_extending.md Docs updated to include knowledge-base format annotation and examples.
bun.lock Updates workspace package references from dataset to knowledge-base.
TODO.md Removes/updates dataset-related TODO entries.
.claude/CLAUDE.md Updates schema conventions and package naming references.
Comments suppressed due to low confidence (3)

packages/knowledge-base/README.md:433

  • registerKnowledgeBase is now async, but the README examples still call it without await. This can lead to flaky examples/tests (the KB may not be persisted/available before subsequent reads). Update the snippet to await registerKnowledgeBase(...) (and make the surrounding example async) to match the new API.
    packages/knowledge-base/README.md:545
  • Shared-table mode docs call registerKnowledgeBase(...) without await, but registerKnowledgeBase is async in this PR. The example should await these calls (and be inside an async function) to ensure the KB record is persisted before any follow-up operations.
    packages/knowledge-base/package.json:6
  • package.json was renamed to @workglow/knowledge-base, but the description still says "Dataset package for Workglow.". Update the description to reflect the new package name/purpose to avoid confusing consumers and generated docs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +21 to +26
/**
* Wrapper implementing `ITabularStorage` that delegates to an inner shared
* storage instance, injecting `kb_id` on writes and filtering by `kb_id` on
* reads. The outer interface does not include `kb_id` — it is transparent to
* the `KnowledgeBase` class.
*/
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared-table mode introduces new scoping logic (ScopedTabularStorage / ScopedVectorStorage) but there are no accompanying tests validating isolation (e.g., writes inject kb_id, reads filter/strip, and deletes can’t affect other KBs). Adding targeted vitest coverage for these wrappers would help prevent multi-tenant data leaks/regressions.

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +78
async delete(key: PrimaryKey | Entity): Promise<void> {
return this.inner.delete(key as any);
}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScopedTabularStorage.delete() delegates directly to the inner storage without scoping to kb_id. In shared-table mode this can delete rows belonging to other knowledge bases when primary keys are not globally unique (or when an Entity object is passed). Consider implementing delete as a scoped delete (e.g., fetch+verify kb_id first, or call inner.deleteSearch({ ...primaryKey, kb_id: this.kbId })).

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +56
async similaritySearch(
query: TypedArray,
options?: VectorSearchOptions<Metadata>
): Promise<(Entity & { score: number })[]> {
const results = await this.inner.similaritySearch(query, {
...options,
// Request extra results to account for post-filtering
topK: options?.topK ? options.topK * 3 : undefined,
} as any);

const filtered = results
.filter((r: any) => r.kb_id === this.kbId)
.slice(0, options?.topK);

Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScopedVectorStorage.similaritySearch() only requests extra results when options.topK is provided. When callers omit topK (letting the inner storage default to 10), post-filtering by kb_id can easily return far fewer than the expected default because the initial result set may be dominated by other KBs. Consider computing an explicit effective topK (e.g., (options?.topK ?? defaultTopK) * factor) and slicing to (options?.topK ?? defaultTopK) after filtering.

Copilot uses AI. Check for mistakes.
Comment on lines 295 to +296
import { getChunkVectorRepository, getGlobalChunkVectorRepositories } from "@workglow/storage";
import { registerChunkVectorRepository, getGlobalChunkVectorRepositories } from "@workglow/dataset";
import { registerChunkVectorRepository, getGlobalChunkVectorRepositories } from "@workglow/knowledge-base";
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This README example imports getGlobalChunkVectorRepositories from both @workglow/storage and @workglow/knowledge-base, which will be a duplicate identifier. Also, registerChunkVectorRepository / getChunkVectorRepository / getGlobalChunkVectorRepositories don’t appear to exist anywhere in the repo, so this snippet likely won’t compile as written—please update it to the actual registry API (or remove the global-registry section if vector storages aren’t globally resolvable).

Copilot uses AI. Check for mistakes.
`KnowledgeBase` — unified class owning both document storage (tabular) and chunk storage (vector).

- `createKnowledgeBase({ name, vectorDimensions })` — factory (in-memory, auto-registers)
- `registerKnowledgeBase(id, kb)` / `getKnowledgeBase(id)` / `getGlobalKnowledgeBases()` — global registry
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.claude/CLAUDE.md still documents registerKnowledgeBase(id, kb) as a synchronous call, but in this PR registerKnowledgeBase is async and should be awaited. Update the doc snippet/bullet to reflect the async API to avoid misleading contributors.

Suggested change
- `registerKnowledgeBase(id, kb)` / `getKnowledgeBase(id)` / `getGlobalKnowledgeBases()` — global registry
- `await registerKnowledgeBase(id, kb)` / `getKnowledgeBase(id)` / `getGlobalKnowledgeBases()`async global registry registration + lookup

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants