Skip to content

ria-19/semantic_router

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Semantic Router (Fine-Tuned Agent Brain)

Python Unsloth Llama 3 Pydantic MIT License

Phase 1: Synthetic Data Engineering (Complete)
Phase 2: Dataset Publication (Complete)
Phase 3: QLoRA Fine-Tuning (In Progress)

Mission

To replace massive, high-latency System Prompts in Agent architectures with a specialized, fine-tuned Adapter.

This project builds the "Brain" of an Autonomous Coding Agent. Instead of relying on a generic LLM to guess how to use tools, we are fine-tuning Llama-3.1-8B to function as a deterministic Intent Router that converts natural language queries into strict, executable JSON tool calls with < 200ms latency.

Dataset Release

The training dataset is now publicly available on HuggingFace:

tai-tai-sama/semantic-router-dataset

  • 502 high-fidelity examples (451 train / 51 test)
  • Generated by GPT-5.2, Gemini 2.5, Llama-3.3 ensemble
  • 60-70% validation pass rate (only production-ready examples survive)
  • Stratified by intent distribution and status types
  • MIT Licensed for commercial use

Architecture

We moved beyond simple scripting to a modular Data Factory approach. The pipeline generates high-diversity training examples (Domains + Personas + Edge Cases) while enforcing strict logic constraints through a 3-layer validation system.

The File Structure

.
├── generate_data.py       # Main entry point for the data factory
├── src/
│   ├── schemas.py         # Pydantic definitions for all tools
│   ├── prompts.py         # Advanced prompt logic (CoT, Anti-Hallucination)
│   ├── generator.py       # Multi-model generation logic
│   ├── client.py          # Instructor client setup
│   ├── formatting.py      # ChatML Formatter: JSON → Llama-3 Training Format
│   ├── upload_data.py     # HuggingFace Dataset Uploader
│   └── utils.py           # Validation & Null-Tax removal
├── data/
│   ├── raw/               # JSONL files (User → Thought → Tool)
│   └── formatted/         # ChatML formatted training data
└── notebooks/             # Colab notebooks for Unsloth training

The "Golden" Data Strategy

This is not just random synthetic data. The pipeline implements Logic-Aware Generation to prevent common agent failures:

1. Anti-Hallucination (The "Magic Path" Fix)

  • Constraint: The generator distinguishes between Discovery (Search) and Action (File Ops).
  • Result: The agent never attempts to read a file (file_manager) unless the user explicitly provides the path or context.

2. Avoiding the "Null Tax"

  • Constraint: Output schemas use exclude_none=True.
  • Result: We save ~20% token usage per example by stripping empty fields (e.g., content: null during read operations), resulting in faster inference.

3. Structured Chain-of-Thought (CoT)

  • Constraint: Enforced minimum 6-word reasoning and explicit tool selection logic.
  • Result: The model doesn't just parrot the user; it explains why it chose a tool (e.g., "User is asking for a concept, so I must use semantic search, not exact match").

4. "Model Roulette" Generation

  • Strategy: We rotate between GPT-5.2 (High-fidelity), Gemini 2.5 Pro (Advanced reasoning), Gemini 2.5 Flash (Speed), and Llama-3.3-70B (Diversity) to prevent "Model Collapse" and avoid API rate limits.
  • Result: Linguistic diversity across 40+ domains, 35+ personas, and 70+ query styles.

5. 3-Layer Validation Pipeline

Layer Type Purpose
1. Structural Pydantic Schema JSON well-formedness, required keys, type correctness
2. Quality Heuristic Analysis Anti-parroting, substantive reasoning (>6 words), specific outputs
3. Domain Logic Safety & Semantics Unsafe code detection (e.g., rm -rf), content validation

Only 60-70% of generated examples survive all gates.

The Toolset (Schema)

The fine-tuned model is trained to route requests to these four deterministic tools:

Tool Capability Logic Constraint
codebase_search RAG / Semantic Search Must choose exact vs semantic vs hybrid mode based on query type
file_manager Read / Write / Patch / List Requires explicit paths. No guessing. Strict validation rules
sandbox_exec Python Interpreter For calculation, verification, or logic testing only. 30s timeout
ask_human Human-in-the-Loop Triggered by ambiguity or high-risk actions (e.g., DB deletion)

Output Format: Discriminated Union

The model produces a flattened discriminated union with two variants:

Type A: Tool Invocation (status="running")

{
  "status": "running",
  "thought": "User needs to find authentication logic. Semantic search is appropriate.",
  "tool_use": {
    "tool_name": "codebase_search",
    "arguments": {
      "query": "authentication middleware JWT validation",
      "mode": "semantic"
    }
  },
  "final_answer": null
}

Type B: Direct Answer (status="complete")

{
  "status": "complete",
  "thought": null,
  "tool_use": null,
  "final_answer": "The server runs on port 8000 by default. Override with the PORT environment variable."
}

Usage

This project uses uv for modern, fast Python dependency management.

1. Setup Environment

Create a .env file with your API keys:

GROQ_API_KEY=gsk_...
GEMINI_API_KEY=AIza...
HF_TOKEN=hf_...

2. Generate Synthetic Data

Run the factory to create data/raw/router_train.jsonl:

uv run generate_data.py

Note: Check src/config.py to adjust batch size, domains, and personas.

3. Format for Llama-3

Convert the raw JSONL into ChatML format:

uv run src/formatting.py

4. Upload to HuggingFace

Version control your dataset:

uv run src/upload_data.py

5. Fine-Tune with Unsloth

Use the provided Colab notebook in notebooks/ or follow the training guide in the dataset card.

Dataset Statistics

  • Total Examples: 502 (451 train / 51 test)
  • Intent Distribution:
    • 35% Search operations
    • 24% Compute/execution tasks
    • 18% File modifications
    • 15% Direct answers
    • 8% Human escalations
  • Diversity Metrics:
    • 40+ domains (E-commerce, Healthcare, Fintech, ML, Gaming, etc.)
    • 35+ personas (SRE, CTO, QA, Data Scientist, Junior Dev, etc.)
    • 70+ query styles (Fragmented, narrative, urgent, code-mixed, etc.)

Key Learnings & Design Decisions

Why Multi-Model Generation?

Single-model synthetic data suffers from mode collapse and stylistic uniformity. By rotating between GPT-5.2, Gemini 2.5, and Llama-3.3, we achieved:

  • Diverse linguistic patterns
  • Robust generalization across domains
  • API rate limit distribution

Why Strict Validation?

Common synthetic datasets accept malformed outputs that hurt model performance. Our 3-layer validation ensures:

  • Only production-ready examples enter training
  • Unsafe operations are filtered
  • Generic or parroted responses are rejected

Why Discriminated Union Schema?

The status field acts as a type discriminator, allowing the model to learn:

  • When to invoke tools vs. answer directly
  • Proper null handling (avoiding the "null tax")
  • Clean separation between reasoning and action

Knowledge Base

Check the docs/ folder for research notes on:

  • LoRA / PEFT: Why we freeze base weights and only train adapters
  • Unsloth: Optimization techniques for 2x faster training with lower memory
  • Data Hygiene: Schema validation rules and quality filtering
  • Prompt Engineering: Meta-prompting architecture for generation

Roadmap

  • Phase 1: Synthetic Data Engineering
  • Phase 2: Dataset Publication on HuggingFace
  • Phase 3: QLoRA Fine-Tuning with Unsloth
  • Phase 4: Evaluation Suite (Accuracy, Latency, Safety)
  • Phase 5: Production Deployment & Benchmarking

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Unsloth for efficient fine-tuning infrastructure
  • Instructor for structured output generation
  • Anthropic, OpenAI, Google for frontier model access
  • HuggingFace for dataset hosting

Contact

Riya Sangwan - @ria-19

Dataset: tai-tai-sama/semantic-router-dataset


"Synthetic data is only as good as the validation pipeline that produces it."

About

A QLoRA fine-tuned Llama-3.1-8B-Instruct AI agent for structured outputs, secure Python sandbox execution, hybrid code search, web scraping, and safe human-in-the-loop automation. Perfect for developers, researchers, and AI-assisted coding.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors