Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .claude/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ Required static properties: `type`, `category`, `title`, `description`, `cacheab
- `runReactive()` → `executeReactive()` — lightweight, UI previews only, keeps PENDING, must be <1ms
- Lifecycle: `PENDING → PROCESSING → COMPLETED | FAILED | ABORTED`

**Schema conventions**: JSON Schema objects. Properties can have `format` annotations for runtime type resolution: `format: "model"`, `format: "model:EmbeddingTask"`, `format: "storage:tabular"`, `format: "dataset:knowledge-base"`. Properties with `x-ui-manual: true` are user-added ports.
**Schema conventions**: JSON Schema objects. Properties can have `format` annotations for runtime type resolution: `format: "model"`, `format: "model:EmbeddingTask"`, `format: "storage:tabular"`, `format: "knowledge-base"`. Properties with `x-ui-manual: true` are user-added ports.

**TaskRegistry** — global class registry: `TaskRegistry.registerTask(MyTask)`.

Expand All @@ -132,13 +132,13 @@ Event-driven: storages emit `put`, `get`, `delete`, `deleteAll`.

Auto-generated PKs: `x-auto-generated: true` in schema — integers auto-increment, strings get UUID.

### `@workglow/dataset` — knowledge base & documents
### `@workglow/knowledge-base` — knowledge base & documents

`KnowledgeBase` — unified class owning both document storage (tabular) and chunk storage (vector). Replaces the old `DocumentDataset` + `DocumentChunkDataset` split.
`KnowledgeBase` — unified class owning both document storage (tabular) and chunk storage (vector).

- `createKnowledgeBase({ name, vectorDimensions })` — factory (in-memory, auto-registers)
- `registerKnowledgeBase(id, kb)` / `getKnowledgeBase(id)` / `getGlobalKnowledgeBases()` — global registry
- `TypeKnowledgeBase()` — JSON Schema helper for task inputs (format `"dataset:knowledge-base"`)
- `TypeKnowledgeBase()` — JSON Schema helper for task inputs (format `"knowledge-base"`)
- `Document` — wraps a `DocumentRootNode` tree + metadata
- `ChunkRecord` — flat chunk with tree linkage (`nodePath`, `depth`)
- `ChunkVectorStorageSchema` / `ChunkVectorPrimaryKey` — vector storage schema for chunks
Expand Down
10 changes: 1 addition & 9 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,9 @@ TODO.md
- [x] No fixed column names, use the schema to define the columns.
- [ ] Option for which column to use if there are multiple, default to the first one.
- [ ] Use @mceachen/sqlite-vec for sqlite storage.
- [ ] Datasets Package
- [x] Documents dataset (mabye rename to DocumentDataset)
- [ ] Chunks Package (or part of DocumentDataset?)
- [x] Move Model repository to datasets package.
- [x] Chunks and nodes are not always the same.
- [x] And we may need to save the chunk's node path. Or paths? or document range? Standard metadata?
- [ ] Instead of passing doc_id around, pass a document key that is of type unknown (string or object)
- [ ] Instead of passing doc_id around, pass a document key that is of type unknown (string or object)

- [ ] Get a better model for question answering.
- [ ] Get a better model for named entity recognition, the current one recognized everything as a token, not helpful.
Expand All @@ -27,10 +23,6 @@ TODO.md

- [ ] Consider different ways to connect tasks to queues. What is a task? What is a job?

- [ ] Input and outputs are all scalar, arrays, or unions. But what about streams? Stream of items in an array, stream of content for a scalar like a string, etc.

onnx-community/ModernBERT-finetuned-squad-ONNX - summarization

Rework the Document Dataset. Currently there is a Document storage of tabular storage type, and that should be registered as a "dataset:document:source" meaning the source material in node format. And there is already a "dataset:document-chunk" for the chunk/vector storage which should be registered as a "dataset:document:chunk" with a well defined metadata schema. The two combined should be registered as a "dataset:document" which is the complete document with its source and all its chunks and metadata. This is for convenience but not used by tasks or ai tasks.

The sqlitevectorstorage currently does not use a built in vector search. Use @mceachen/sqlite-vec for sqlite storage vector indexing.
40 changes: 20 additions & 20 deletions bun.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

28 changes: 20 additions & 8 deletions docs/developers/03_extending.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,20 @@ This document covers how to write your own tasks. For a more practical guide to
- [Define Inputs and Outputs](#define-inputs-and-outputs)
- [Register the Task](#register-the-task)
- [Schema Format Annotations](#schema-format-annotations)
- [Built-in Format Annotations](#built-in-format-annotations)
- [Example: Using Format Annotations](#example-using-format-annotations)
- [Creating Custom Format Resolvers](#creating-custom-format-resolvers)
- [Job Queues and LLM tasks](#job-queues-and-llm-tasks)
- [Write a new Compound Task](#write-a-new-compound-task)
- [Reactive Task UIs](#reactive-task-uis)
- [AI and RAG Tasks](#ai-and-rag-tasks)
- [Document Processing Tasks](#document-processing-tasks)
- [Vector and Embedding Tasks](#vector-and-embedding-tasks)
- [Retrieval and Generation Tasks](#retrieval-and-generation-tasks)
- [Chainable RAG Pipeline Example](#chainable-rag-pipeline-example)
- [Retrieval Pipeline Example](#retrieval-pipeline-example)
- [Hierarchical Document Structure](#hierarchical-document-structure)
- [Task Data Flow](#task-data-flow)

## Write a new Task

Expand Down Expand Up @@ -138,13 +149,14 @@ When defining task input schemas, you can use `format` annotations to enable aut

The system supports several format annotations out of the box:

| Format | Description | Helper Function |
| ------------------------------ | ----------------------------------- | ----------------------------- |
| `model` | Any AI model configuration | `TypeModel()` |
| `model:TaskName` | Model compatible with specific task | — |
| `storage:tabular` | Tabular data dataset | `TypeTabularStorage()` |
| `dataset:document-node-vector` | Vector storage dataset | `TypeChunkVectorRepository()` |
| `dataset:document` | Document dataset | `TypeDocumentRepository()` |
| Format | Description | Helper Function |
| --------------------------------- | ----------------------------------- | ----------------------------- |
| `model` | Any AI model configuration | `TypeModel()` |
| `model:TaskName` | Model compatible with specific task | — |
| `storage:tabular` | Tabular data storage | `TypeTabularStorage()` |
| `knowledge-base` | Knowledge base instance | `TypeKnowledgeBase()` |
| `credential` | Credential from credential store | — |
| `tasks` | Task class from task registry | — |

### Example: Using Format Annotations

Expand Down Expand Up @@ -290,7 +302,7 @@ Tasks chain together through compatible input/output schemas:

```typescript
import { Workflow } from "@workglow/task-graph";
import { createKnowledgeBase } from "@workglow/dataset";
import { createKnowledgeBase } from "@workglow/knowledge-base";

// Create a KnowledgeBase (auto-registers globally as "my-kb")
const kb = await createKnowledgeBase({
Expand Down
2 changes: 1 addition & 1 deletion examples/web/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
"@codemirror/lang-json": "^6.0.2",
"@workglow/ai": "workspace:*",
"@workglow/debug": "workspace:*",
"@workglow/dataset": "workspace:*",
"@workglow/knowledge-base": "workspace:*",
"@workglow/ai-provider": "workspace:*",
"@workglow/storage": "workspace:*",
"@workglow/job-queue": "workspace:*",
Expand Down
2 changes: 1 addition & 1 deletion examples/web/vite.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ export default defineConfig({
"@workglow/storage",
"@workglow/task-graph",
"@workglow/debug",
"@workglow/dataset",
"@workglow/knowledge-base",
"@workglow/tasks",
"@workglow/util",
"@workglow/sqlite",
Expand Down
22 changes: 14 additions & 8 deletions packages/ai/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -445,14 +445,19 @@ The AI package provides a comprehensive set of tasks for building RAG pipelines.

```typescript
import { Workflow } from "@workglow/task-graph";
import { createKnowledgeBase } from "@workglow/dataset";
import { createKnowledgeBase } from "@workglow/knowledge-base";

// Create a KnowledgeBase (auto-registers globally as "my-kb")
const kb = await createKnowledgeBase({
name: "my-kb",
vectorDimensions: 384, // must match your embedding model
});

// Or use shared-table mode for multi-tenant scenarios — see @workglow/knowledge-base docs
// const scopedDocs = new ScopedTabularStorage(sharedDocStorage, "my-kb");
// const scopedChunks = new ScopedVectorStorage(sharedChunkStorage, "my-kb");
// const kb = new KnowledgeBase("my-kb", scopedDocs, scopedChunks);

// Document ingestion - fully chainable, no loops required
await new Workflow()
.structuralParser({
Expand Down Expand Up @@ -596,13 +601,14 @@ This resolution is handled by the input resolver system, which inspects schema `

### Supported Format Annotations

| Format | Description | Resolver |
| --------------------------------- | ---------------------------------------- | -------------------------- |
| `model` | Any AI model configuration | ModelRepository |
| `model:TaskName` | Model compatible with specific task type | ModelRepository |
| `repository:tabular` | Tabular data repository | TabularStorageRegistry |
| `repository:document-node-vector` | Vector storage repository | VectorRepositoryRegistry |
| `repository:document` | Document repository | DocumentRepositoryRegistry |
| Format | Description | Resolver |
| ------------------ | ---------------------------------------- | -------------------------- |
| `model` | Any AI model configuration | ModelRepository |
| `model:TaskName` | Model compatible with specific task type | ModelRepository |
| `storage:tabular` | Tabular data storage | TabularStorageRegistry |
| `knowledge-base` | Knowledge base instance | KnowledgeBaseRegistry |
| `credential` | Credential from credential store | CredentialStoreRegistry |
| `tasks` | Task class from task registry | TaskRegistry |

### Custom Model Validation

Expand Down
6 changes: 3 additions & 3 deletions packages/ai/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -37,14 +37,14 @@
"access": "public"
},
"peerDependencies": {
"@workglow/dataset": "workspace:*",
"@workglow/knowledge-base": "workspace:*",
"@workglow/job-queue": "workspace:*",
"@workglow/storage": "workspace:*",
"@workglow/task-graph": "workspace:*",
"@workglow/util": "workspace:*"
},
"peerDependenciesMeta": {
"@workglow/dataset": {
"@workglow/knowledge-base": {
"optional": false
},
"@workglow/job-queue": {
Expand All @@ -61,7 +61,7 @@
}
},
"devDependencies": {
"@workglow/dataset": "workspace:*",
"@workglow/knowledge-base": "workspace:*",
"@workglow/job-queue": "workspace:*",
"@workglow/storage": "workspace:*",
"@workglow/task-graph": "workspace:*",
Expand Down
4 changes: 2 additions & 2 deletions packages/ai/src/task/ChunkRetrievalTask.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
* SPDX-License-Identifier: Apache-2.0
*/

import { KnowledgeBase, TypeKnowledgeBase, type ChunkRecord } from "@workglow/dataset";
import { KnowledgeBase, TypeKnowledgeBase, type ChunkRecord } from "@workglow/knowledge-base";
import {
CreateWorkflow,
IExecuteContext,
Expand All @@ -22,7 +22,7 @@ import {
} from "@workglow/util";
import { TypeModel, TypeSingleOrArray } from "./base/AiTaskSchemas";
import { TextEmbeddingTask } from "./TextEmbeddingTask";
import type { ChunkSearchResult } from "@workglow/dataset";
import type { ChunkSearchResult } from "@workglow/knowledge-base";

const inputSchema = {
type: "object",
Expand Down
2 changes: 1 addition & 1 deletion packages/ai/src/task/ChunkToVectorTask.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
* SPDX-License-Identifier: Apache-2.0
*/

import { ChunkRecordSchema, type ChunkRecord } from "@workglow/dataset";
import { ChunkRecordSchema, type ChunkRecord } from "@workglow/knowledge-base";
import {
CreateWorkflow,
IExecuteContext,
Expand Down
2 changes: 1 addition & 1 deletion packages/ai/src/task/ChunkVectorHybridSearchTask.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
* SPDX-License-Identifier: Apache-2.0
*/

import { KnowledgeBase, TypeKnowledgeBase, type ChunkRecord } from "@workglow/dataset";
import { KnowledgeBase, TypeKnowledgeBase, type ChunkRecord } from "@workglow/knowledge-base";
import {
CreateWorkflow,
IExecuteContext,
Expand Down
4 changes: 2 additions & 2 deletions packages/ai/src/task/ChunkVectorSearchTask.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
* SPDX-License-Identifier: Apache-2.0
*/

import { KnowledgeBase, TypeKnowledgeBase } from "@workglow/dataset";
import { KnowledgeBase, TypeKnowledgeBase } from "@workglow/knowledge-base";
import {
CreateWorkflow,
IExecuteContext,
Expand Down Expand Up @@ -127,7 +127,7 @@ export class ChunkVectorSearchTask extends Task<

async execute(
input: VectorStoreSearchTaskInput,
context: IExecuteContext
_context: IExecuteContext
): Promise<VectorStoreSearchTaskOutput> {
const { knowledgeBase, query, topK = 10, filter, scoreThreshold = 0 } = input;

Expand Down
2 changes: 1 addition & 1 deletion packages/ai/src/task/ChunkVectorUpsertTask.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
* SPDX-License-Identifier: Apache-2.0
*/

import { KnowledgeBase, TypeKnowledgeBase } from "@workglow/dataset";
import { KnowledgeBase, TypeKnowledgeBase } from "@workglow/knowledge-base";
import {
CreateWorkflow,
IExecuteContext,
Expand Down
2 changes: 1 addition & 1 deletion packages/ai/src/task/ContextBuilderTask.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
* SPDX-License-Identifier: Apache-2.0
*/

import { estimateTokens } from "@workglow/dataset";
import { estimateTokens } from "@workglow/knowledge-base";
import {
CreateWorkflow,
IExecuteReactiveContext,
Expand Down
2 changes: 1 addition & 1 deletion packages/ai/src/task/DocumentEnricherTask.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ import {
type DocumentNode,
type Entity,
type NodeEnrichment,
} from "@workglow/dataset";
} from "@workglow/knowledge-base";
import {
CreateWorkflow,
IExecuteContext,
Expand Down
2 changes: 1 addition & 1 deletion packages/ai/src/task/HierarchicalChunkerTask.ts
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ import {
type ChunkRecord,
type DocumentNode,
type TokenBudget,
} from "@workglow/dataset";
} from "@workglow/knowledge-base";
import {
CreateWorkflow,
IExecuteContext,
Expand Down
Loading