Conversation
…switch from the name datasets - Replaced the deprecated `@workglow/dataset` with the new `@workglow/knowledge-base` package across various modules, including updates to package.json, README, and task schemas. - Consolidated document and chunk management under the new KnowledgeBase architecture, enhancing the handling of documents and chunks in RAG workflows. - Updated import paths and dependencies in multiple files to reflect the transition to the new package structure. - Removed the old dataset package and its related files, streamlining the codebase.
…ADME imports and description (#280)
- Added support for shared-table mode in the KnowledgeBase, allowing multiple knowledge bases to share the same underlying storage tables, reducing table proliferation. - Implemented `ScopedTabularStorage` and `ScopedVectorStorage` wrappers to manage data partitioning by `kb_id`. - Updated documentation to include new shared-table features and examples for setting up shared storage. - Enhanced format annotations and task schemas to accommodate the new storage structure.
There was a problem hiding this comment.
Pull request overview
This PR renames the former @workglow/dataset package to @workglow/knowledge-base, updates imports/docs throughout the monorepo, and extends the knowledge base system with persistent registry metadata plus shared-table multi-tenant storage helpers.
Changes:
- Replace
@workglow/datasetdependency/exports/usages with@workglow/knowledge-baseacross packages, examples, and docs. - Make
registerKnowledgeBase()asynchronous and persist KB metadata via a newKnowledgeBaseRepository. - Add shared-table mode support via
Shared*schemas andScopedTabularStorage/ScopedVectorStoragewrappers.
Reviewed changes
Copilot reviewed 54 out of 69 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/workglow/src/common.ts | Re-export knowledge-base package instead of dataset. |
| packages/workglow/package.json | Dependency rename to @workglow/knowledge-base. |
| packages/workglow/README.md | Package list updated to knowledge-base. |
| packages/test/src/test/util/Document.test.ts | Test imports switched to knowledge-base. |
| packages/test/src/test/task-graph/InputResolver.test.ts | Test imports switched to knowledge-base. |
| packages/test/src/test/rag/StructuralParser.test.ts | Test imports switched to knowledge-base. |
| packages/test/src/test/rag/RagWorkflow.integration.test.ts | Test imports switched to knowledge-base. |
| packages/test/src/test/rag/HybridSearchTask.test.ts | Import rename + await registerKnowledgeBase. |
| packages/test/src/test/rag/HierarchicalChunker.test.ts | Test imports switched to knowledge-base. |
| packages/test/src/test/rag/FullChain.test.ts | Test imports switched to knowledge-base. |
| packages/test/src/test/rag/EndToEnd.integration.test.ts | Test imports switched to knowledge-base. |
| packages/test/src/test/rag/DocumentRepository.test.ts | Test imports switched to knowledge-base. |
| packages/test/src/test/rag/DocumentChunkUpsertTask.test.ts | Import rename + await registerKnowledgeBase. |
| packages/test/src/test/rag/DocumentChunkSearchTask.test.ts | Import rename + await registerKnowledgeBase. |
| packages/test/src/test/rag/DocumentChunkRetrievalTask.test.ts | Import rename + await registerKnowledgeBase. |
| packages/test/src/test/rag/Document.test.ts | Test imports switched to knowledge-base. |
| packages/test/src/test/rag/ChunkToVector.test.ts | Test imports switched to knowledge-base. |
| packages/test/package.json | Peer/dev deps renamed to knowledge-base. |
| packages/task-graph/src/task/InputResolver.ts | Removes outdated resolver-format comment. |
| packages/storage/src/vector/README.md | Docs updated to reference KnowledgeBase integration. |
| packages/storage/README.md | Removes outdated repository schema examples. |
| packages/knowledge-base/tsconfig.json | New tsconfig for the renamed package. |
| packages/knowledge-base/src/util/DatasetSchema.ts | Updates TypeKnowledgeBase format to knowledge-base. |
| packages/knowledge-base/src/types.ts | New entrypoint re-export. |
| packages/knowledge-base/src/node.ts | New entrypoint re-export. |
| packages/knowledge-base/src/knowledge-base/createKnowledgeBase.ts | Adds title/description options + awaits KB registration. |
| packages/knowledge-base/src/knowledge-base/SharedTableSchemas.ts | Adds shared-table schemas + index definitions. |
| packages/knowledge-base/src/knowledge-base/ScopedVectorStorage.ts | Adds KB-scoped wrapper for shared vector storage. |
| packages/knowledge-base/src/knowledge-base/ScopedTabularStorage.ts | Adds KB-scoped wrapper for shared tabular storage. |
| packages/knowledge-base/src/knowledge-base/KnowledgeBaseSchema.ts | Adds KB metadata schema + shared-table helpers + table naming. |
| packages/knowledge-base/src/knowledge-base/KnowledgeBaseRepository.ts | New repository abstraction for KB metadata persistence. |
| packages/knowledge-base/src/knowledge-base/KnowledgeBaseRegistry.ts | Async KB registration + persistent record creation + input resolver. |
| packages/knowledge-base/src/knowledge-base/KnowledgeBase.ts | Adds title/description fields to KB instances. |
| packages/knowledge-base/src/knowledge-base/InMemoryKnowledgeBaseRepository.ts | In-memory implementation of KB metadata repository. |
| packages/knowledge-base/src/document/StructuralParser.ts | New StructuralParser implementation in knowledge-base package. |
| packages/knowledge-base/src/document/DocumentStorageSchema.ts | New document tabular schema/types. |
| packages/knowledge-base/src/document/DocumentSchema.ts | New document node schemas/types. |
| packages/knowledge-base/src/document/DocumentNode.ts | New document-tree helpers (tokens, traversal, ranges). |
| packages/knowledge-base/src/document/Document.ts | New Document class implementation. |
| packages/knowledge-base/src/common.ts | Expands exports (repository, schemas, scoped/shared helpers). |
| packages/knowledge-base/src/common-server.ts | New server entrypoint re-export. |
| packages/knowledge-base/src/chunk/ChunkVectorStorageSchema.ts | New chunk vector storage schema/types. |
| packages/knowledge-base/src/chunk/ChunkSchema.ts | New unified ChunkRecord schema/type. |
| packages/knowledge-base/src/bun.ts | New Bun entrypoint. |
| packages/knowledge-base/src/browser.ts | New browser entrypoint. |
| packages/knowledge-base/package.json | Renames package to @workglow/knowledge-base. |
| packages/knowledge-base/README.md | Docs updated + adds shared-table mode documentation. |
| packages/knowledge-base/LICENSE | Adds Apache 2.0 license file to the package. |
| packages/knowledge-base/CHANGELOG.md | Renames changelog header to knowledge-base. |
| packages/dataset/src/knowledge-base/KnowledgeBaseRegistry.ts | Removes old dataset KB registry implementation. |
| packages/ai/tsconfig.json | Removes baseUrl configuration. |
| packages/ai/src/task/StructuralParserTask.ts | Imports switched to knowledge-base. |
| packages/ai/src/task/HierarchyJoinTask.ts | Imports switched to knowledge-base. |
| packages/ai/src/task/HierarchicalChunkerTask.ts | Imports switched to knowledge-base. |
| packages/ai/src/task/DocumentEnricherTask.ts | Imports switched to knowledge-base. |
| packages/ai/src/task/ContextBuilderTask.ts | Imports switched to knowledge-base. |
| packages/ai/src/task/ChunkVectorUpsertTask.ts | Imports switched to knowledge-base. |
| packages/ai/src/task/ChunkVectorSearchTask.ts | Imports switched + unused context renamed to _context. |
| packages/ai/src/task/ChunkVectorHybridSearchTask.ts | Imports switched to knowledge-base. |
| packages/ai/src/task/ChunkToVectorTask.ts | Imports switched to knowledge-base. |
| packages/ai/src/task/ChunkRetrievalTask.ts | Imports switched to knowledge-base. |
| packages/ai/package.json | Peer/dev deps renamed to knowledge-base. |
| packages/ai/README.md | Docs updated for knowledge-base + supported format list updated. |
| examples/web/vite.config.js | Workspace package allowlist updated to knowledge-base. |
| examples/web/package.json | Dependency renamed to knowledge-base. |
| docs/developers/03_extending.md | Docs updated to include knowledge-base format annotation and examples. |
| bun.lock | Updates workspace package references from dataset to knowledge-base. |
| TODO.md | Removes/updates dataset-related TODO entries. |
| .claude/CLAUDE.md | Updates schema conventions and package naming references. |
Comments suppressed due to low confidence (3)
packages/knowledge-base/README.md:433
registerKnowledgeBaseis now async, but the README examples still call it withoutawait. This can lead to flaky examples/tests (the KB may not be persisted/available before subsequent reads). Update the snippet toawait registerKnowledgeBase(...)(and make the surrounding exampleasync) to match the new API.
packages/knowledge-base/README.md:545- Shared-table mode docs call
registerKnowledgeBase(...)withoutawait, butregisterKnowledgeBaseis async in this PR. The example shouldawaitthese calls (and be inside an async function) to ensure the KB record is persisted before any follow-up operations.
packages/knowledge-base/package.json:6 package.jsonwas renamed to@workglow/knowledge-base, but thedescriptionstill says "Dataset package for Workglow.". Update the description to reflect the new package name/purpose to avoid confusing consumers and generated docs.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /** | ||
| * Wrapper implementing `ITabularStorage` that delegates to an inner shared | ||
| * storage instance, injecting `kb_id` on writes and filtering by `kb_id` on | ||
| * reads. The outer interface does not include `kb_id` — it is transparent to | ||
| * the `KnowledgeBase` class. | ||
| */ |
There was a problem hiding this comment.
Shared-table mode introduces new scoping logic (ScopedTabularStorage / ScopedVectorStorage) but there are no accompanying tests validating isolation (e.g., writes inject kb_id, reads filter/strip, and deletes can’t affect other KBs). Adding targeted vitest coverage for these wrappers would help prevent multi-tenant data leaks/regressions.
| async delete(key: PrimaryKey | Entity): Promise<void> { | ||
| return this.inner.delete(key as any); | ||
| } |
There was a problem hiding this comment.
ScopedTabularStorage.delete() delegates directly to the inner storage without scoping to kb_id. In shared-table mode this can delete rows belonging to other knowledge bases when primary keys are not globally unique (or when an Entity object is passed). Consider implementing delete as a scoped delete (e.g., fetch+verify kb_id first, or call inner.deleteSearch({ ...primaryKey, kb_id: this.kbId })).
| async similaritySearch( | ||
| query: TypedArray, | ||
| options?: VectorSearchOptions<Metadata> | ||
| ): Promise<(Entity & { score: number })[]> { | ||
| const results = await this.inner.similaritySearch(query, { | ||
| ...options, | ||
| // Request extra results to account for post-filtering | ||
| topK: options?.topK ? options.topK * 3 : undefined, | ||
| } as any); | ||
|
|
||
| const filtered = results | ||
| .filter((r: any) => r.kb_id === this.kbId) | ||
| .slice(0, options?.topK); | ||
|
|
There was a problem hiding this comment.
ScopedVectorStorage.similaritySearch() only requests extra results when options.topK is provided. When callers omit topK (letting the inner storage default to 10), post-filtering by kb_id can easily return far fewer than the expected default because the initial result set may be dominated by other KBs. Consider computing an explicit effective topK (e.g., (options?.topK ?? defaultTopK) * factor) and slicing to (options?.topK ?? defaultTopK) after filtering.
| import { getChunkVectorRepository, getGlobalChunkVectorRepositories } from "@workglow/storage"; | ||
| import { registerChunkVectorRepository, getGlobalChunkVectorRepositories } from "@workglow/dataset"; | ||
| import { registerChunkVectorRepository, getGlobalChunkVectorRepositories } from "@workglow/knowledge-base"; |
There was a problem hiding this comment.
This README example imports getGlobalChunkVectorRepositories from both @workglow/storage and @workglow/knowledge-base, which will be a duplicate identifier. Also, registerChunkVectorRepository / getChunkVectorRepository / getGlobalChunkVectorRepositories don’t appear to exist anywhere in the repo, so this snippet likely won’t compile as written—please update it to the actual registry API (or remove the global-registry section if vector storages aren’t globally resolvable).
| `KnowledgeBase` — unified class owning both document storage (tabular) and chunk storage (vector). | ||
|
|
||
| - `createKnowledgeBase({ name, vectorDimensions })` — factory (in-memory, auto-registers) | ||
| - `registerKnowledgeBase(id, kb)` / `getKnowledgeBase(id)` / `getGlobalKnowledgeBases()` — global registry |
There was a problem hiding this comment.
.claude/CLAUDE.md still documents registerKnowledgeBase(id, kb) as a synchronous call, but in this PR registerKnowledgeBase is async and should be awaited. Update the doc snippet/bullet to reflect the async API to avoid misleading contributors.
| - `registerKnowledgeBase(id, kb)` / `getKnowledgeBase(id)` / `getGlobalKnowledgeBases()` — global registry | |
| - `await registerKnowledgeBase(id, kb)` / `getKnowledgeBase(id)` / `getGlobalKnowledgeBases()` — async global registry registration + lookup |
No description provided.