The Semantic Router (Fine-Tuned Agent Brain)

Phase 1: Synthetic Data Engineering (Complete)
Phase 2: Dataset Publication (Complete)
Phase 3: QLoRA Fine-Tuning (In Progress)

Mission

To replace massive, high-latency System Prompts in Agent architectures with a specialized, fine-tuned Adapter.

This project builds the "Brain" of an Autonomous Coding Agent. Instead of relying on a generic LLM to guess how to use tools, we are fine-tuning Llama-3.1-8B to function as a deterministic Intent Router that converts natural language queries into strict, executable JSON tool calls with < 200ms latency.

Dataset Release

The training dataset is now publicly available on HuggingFace:

tai-tai-sama/semantic-router-dataset

502 high-fidelity examples (451 train / 51 test)
Generated by GPT-5.2, Gemini 2.5, Llama-3.3 ensemble
60-70% validation pass rate (only production-ready examples survive)
Stratified by intent distribution and status types
MIT Licensed for commercial use

Architecture

We moved beyond simple scripting to a modular Data Factory approach. The pipeline generates high-diversity training examples (Domains + Personas + Edge Cases) while enforcing strict logic constraints through a 3-layer validation system.

The File Structure

.
├── generate_data.py       # Main entry point for the data factory
├── src/
│   ├── schemas.py         # Pydantic definitions for all tools
│   ├── prompts.py         # Advanced prompt logic (CoT, Anti-Hallucination)
│   ├── generator.py       # Multi-model generation logic
│   ├── client.py          # Instructor client setup
│   ├── formatting.py      # ChatML Formatter: JSON → Llama-3 Training Format
│   ├── upload_data.py     # HuggingFace Dataset Uploader
│   └── utils.py           # Validation & Null-Tax removal
├── data/
│   ├── raw/               # JSONL files (User → Thought → Tool)
│   └── formatted/         # ChatML formatted training data
└── notebooks/             # Colab notebooks for Unsloth training

The "Golden" Data Strategy

This is not just random synthetic data. The pipeline implements Logic-Aware Generation to prevent common agent failures:

1. Anti-Hallucination (The "Magic Path" Fix)

Constraint: The generator distinguishes between Discovery (Search) and Action (File Ops).
Result: The agent never attempts to read a file (file_manager) unless the user explicitly provides the path or context.

2. Avoiding the "Null Tax"

Constraint: Output schemas use exclude_none=True.
Result: We save ~20% token usage per example by stripping empty fields (e.g., content: null during read operations), resulting in faster inference.

3. Structured Chain-of-Thought (CoT)

Constraint: Enforced minimum 6-word reasoning and explicit tool selection logic.
Result: The model doesn't just parrot the user; it explains why it chose a tool (e.g., "User is asking for a concept, so I must use semantic search, not exact match").

4. "Model Roulette" Generation

Strategy: We rotate between GPT-5.2 (High-fidelity), Gemini 2.5 Pro (Advanced reasoning), Gemini 2.5 Flash (Speed), and Llama-3.3-70B (Diversity) to prevent "Model Collapse" and avoid API rate limits.
Result: Linguistic diversity across 40+ domains, 35+ personas, and 70+ query styles.

5. 3-Layer Validation Pipeline

Layer	Type	Purpose
1. Structural	Pydantic Schema	JSON well-formedness, required keys, type correctness
2. Quality	Heuristic Analysis	Anti-parroting, substantive reasoning (>6 words), specific outputs
3. Domain Logic	Safety & Semantics	Unsafe code detection (e.g., `rm -rf`), content validation

Only 60-70% of generated examples survive all gates.

The Toolset (Schema)

The fine-tuned model is trained to route requests to these four deterministic tools:

Tool	Capability	Logic Constraint
`codebase_search`	RAG / Semantic Search	Must choose `exact` vs `semantic` vs `hybrid` mode based on query type
`file_manager`	Read / Write / Patch / List	Requires explicit paths. No guessing. Strict validation rules
`sandbox_exec`	Python Interpreter	For calculation, verification, or logic testing only. 30s timeout
`ask_human`	Human-in-the-Loop	Triggered by ambiguity or high-risk actions (e.g., DB deletion)

Output Format: Discriminated Union

The model produces a flattened discriminated union with two variants:

Type A: Tool Invocation (status="running")

{
  "status": "running",
  "thought": "User needs to find authentication logic. Semantic search is appropriate.",
  "tool_use": {
    "tool_name": "codebase_search",
    "arguments": {
      "query": "authentication middleware JWT validation",
      "mode": "semantic"
    }
  },
  "final_answer": null
}

Type B: Direct Answer (status="complete")

{
  "status": "complete",
  "thought": null,
  "tool_use": null,
  "final_answer": "The server runs on port 8000 by default. Override with the PORT environment variable."
}

Usage

This project uses uv for modern, fast Python dependency management.

1. Setup Environment

Create a .env file with your API keys:

GROQ_API_KEY=gsk_...
GEMINI_API_KEY=AIza...
HF_TOKEN=hf_...

2. Generate Synthetic Data

Run the factory to create data/raw/router_train.jsonl:

uv run generate_data.py

Note: Check src/config.py to adjust batch size, domains, and personas.

3. Format for Llama-3

Convert the raw JSONL into ChatML format:

uv run src/formatting.py

4. Upload to HuggingFace

Version control your dataset:

uv run src/upload_data.py

5. Fine-Tune with Unsloth

Use the provided Colab notebook in notebooks/ or follow the training guide in the dataset card.

Dataset Statistics

Total Examples: 502 (451 train / 51 test)
Intent Distribution:
- 35% Search operations
- 24% Compute/execution tasks
- 18% File modifications
- 15% Direct answers
- 8% Human escalations
Diversity Metrics:
- 40+ domains (E-commerce, Healthcare, Fintech, ML, Gaming, etc.)
- 35+ personas (SRE, CTO, QA, Data Scientist, Junior Dev, etc.)
- 70+ query styles (Fragmented, narrative, urgent, code-mixed, etc.)

Key Learnings & Design Decisions

Why Multi-Model Generation?

Single-model synthetic data suffers from mode collapse and stylistic uniformity. By rotating between GPT-5.2, Gemini 2.5, and Llama-3.3, we achieved:

Diverse linguistic patterns
Robust generalization across domains
API rate limit distribution

Why Strict Validation?

Common synthetic datasets accept malformed outputs that hurt model performance. Our 3-layer validation ensures:

Only production-ready examples enter training
Unsafe operations are filtered
Generic or parroted responses are rejected

Why Discriminated Union Schema?

The status field acts as a type discriminator, allowing the model to learn:

When to invoke tools vs. answer directly
Proper null handling (avoiding the "null tax")
Clean separation between reasoning and action

Knowledge Base

Check the docs/ folder for research notes on:

LoRA / PEFT: Why we freeze base weights and only train adapters
Unsloth: Optimization techniques for 2x faster training with lower memory
Data Hygiene: Schema validation rules and quality filtering
Prompt Engineering: Meta-prompting architecture for generation

Roadmap

Phase 1: Synthetic Data Engineering
Phase 2: Dataset Publication on HuggingFace
Phase 3: QLoRA Fine-Tuning with Unsloth
Phase 4: Evaluation Suite (Accuracy, Latency, Safety)
Phase 5: Production Deployment & Benchmarking

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Unsloth for efficient fine-tuning infrastructure
Instructor for structured output generation
Anthropic, OpenAI, Google for frontier model access
HuggingFace for dataset hosting

Contact

Riya Sangwan - @ria-19

Dataset: tai-tai-sama/semantic-router-dataset

"Synthetic data is only as good as the validation pipeline that produces it."

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

The Semantic Router (Fine-Tuned Agent Brain)

Mission

Dataset Release

Architecture

The File Structure

The "Golden" Data Strategy

1. Anti-Hallucination (The "Magic Path" Fix)

2. Avoiding the "Null Tax"

3. Structured Chain-of-Thought (CoT)

4. "Model Roulette" Generation

5. 3-Layer Validation Pipeline

The Toolset (Schema)

Output Format: Discriminated Union

Usage

1. Setup Environment

2. Generate Synthetic Data

3. Format for Llama-3

4. Upload to HuggingFace

5. Fine-Tune with Unsloth

Dataset Statistics

Key Learnings & Design Decisions

Why Multi-Model Generation?

Why Strict Validation?

Why Discriminated Union Schema?

Knowledge Base

Roadmap

Contributing

License

Acknowledgments

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages