Phase 1: Synthetic Data Engineering (Complete)
Phase 2: Dataset Publication (Complete)
Phase 3: QLoRA Fine-Tuning (In Progress)
To replace massive, high-latency System Prompts in Agent architectures with a specialized, fine-tuned Adapter.
This project builds the "Brain" of an Autonomous Coding Agent. Instead of relying on a generic LLM to guess how to use tools, we are fine-tuning Llama-3.1-8B to function as a deterministic Intent Router that converts natural language queries into strict, executable JSON tool calls with < 200ms latency.
The training dataset is now publicly available on HuggingFace:
tai-tai-sama/semantic-router-dataset
- 502 high-fidelity examples (451 train / 51 test)
- Generated by GPT-5.2, Gemini 2.5, Llama-3.3 ensemble
- 60-70% validation pass rate (only production-ready examples survive)
- Stratified by intent distribution and status types
- MIT Licensed for commercial use
We moved beyond simple scripting to a modular Data Factory approach. The pipeline generates high-diversity training examples (Domains + Personas + Edge Cases) while enforcing strict logic constraints through a 3-layer validation system.
.
├── generate_data.py # Main entry point for the data factory
├── src/
│ ├── schemas.py # Pydantic definitions for all tools
│ ├── prompts.py # Advanced prompt logic (CoT, Anti-Hallucination)
│ ├── generator.py # Multi-model generation logic
│ ├── client.py # Instructor client setup
│ ├── formatting.py # ChatML Formatter: JSON → Llama-3 Training Format
│ ├── upload_data.py # HuggingFace Dataset Uploader
│ └── utils.py # Validation & Null-Tax removal
├── data/
│ ├── raw/ # JSONL files (User → Thought → Tool)
│ └── formatted/ # ChatML formatted training data
└── notebooks/ # Colab notebooks for Unsloth training
This is not just random synthetic data. The pipeline implements Logic-Aware Generation to prevent common agent failures:
- Constraint: The generator distinguishes between Discovery (Search) and Action (File Ops).
- Result: The agent never attempts to read a file (
file_manager) unless the user explicitly provides the path or context.
- Constraint: Output schemas use
exclude_none=True. - Result: We save ~20% token usage per example by stripping empty fields (e.g.,
content: nullduring read operations), resulting in faster inference.
- Constraint: Enforced minimum 6-word reasoning and explicit tool selection logic.
- Result: The model doesn't just parrot the user; it explains why it chose a tool (e.g., "User is asking for a concept, so I must use semantic search, not exact match").
- Strategy: We rotate between GPT-5.2 (High-fidelity), Gemini 2.5 Pro (Advanced reasoning), Gemini 2.5 Flash (Speed), and Llama-3.3-70B (Diversity) to prevent "Model Collapse" and avoid API rate limits.
- Result: Linguistic diversity across 40+ domains, 35+ personas, and 70+ query styles.
| Layer | Type | Purpose |
|---|---|---|
| 1. Structural | Pydantic Schema | JSON well-formedness, required keys, type correctness |
| 2. Quality | Heuristic Analysis | Anti-parroting, substantive reasoning (>6 words), specific outputs |
| 3. Domain Logic | Safety & Semantics | Unsafe code detection (e.g., rm -rf), content validation |
Only 60-70% of generated examples survive all gates.
The fine-tuned model is trained to route requests to these four deterministic tools:
| Tool | Capability | Logic Constraint |
|---|---|---|
codebase_search |
RAG / Semantic Search | Must choose exact vs semantic vs hybrid mode based on query type |
file_manager |
Read / Write / Patch / List | Requires explicit paths. No guessing. Strict validation rules |
sandbox_exec |
Python Interpreter | For calculation, verification, or logic testing only. 30s timeout |
ask_human |
Human-in-the-Loop | Triggered by ambiguity or high-risk actions (e.g., DB deletion) |
The model produces a flattened discriminated union with two variants:
Type A: Tool Invocation (status="running")
{
"status": "running",
"thought": "User needs to find authentication logic. Semantic search is appropriate.",
"tool_use": {
"tool_name": "codebase_search",
"arguments": {
"query": "authentication middleware JWT validation",
"mode": "semantic"
}
},
"final_answer": null
}Type B: Direct Answer (status="complete")
{
"status": "complete",
"thought": null,
"tool_use": null,
"final_answer": "The server runs on port 8000 by default. Override with the PORT environment variable."
}This project uses uv for modern, fast Python dependency management.
Create a .env file with your API keys:
GROQ_API_KEY=gsk_...
GEMINI_API_KEY=AIza...
HF_TOKEN=hf_...Run the factory to create data/raw/router_train.jsonl:
uv run generate_data.pyNote: Check src/config.py to adjust batch size, domains, and personas.
Convert the raw JSONL into ChatML format:
uv run src/formatting.pyVersion control your dataset:
uv run src/upload_data.pyUse the provided Colab notebook in notebooks/ or follow the training guide in the dataset card.
- Total Examples: 502 (451 train / 51 test)
- Intent Distribution:
- 35% Search operations
- 24% Compute/execution tasks
- 18% File modifications
- 15% Direct answers
- 8% Human escalations
- Diversity Metrics:
- 40+ domains (E-commerce, Healthcare, Fintech, ML, Gaming, etc.)
- 35+ personas (SRE, CTO, QA, Data Scientist, Junior Dev, etc.)
- 70+ query styles (Fragmented, narrative, urgent, code-mixed, etc.)
Single-model synthetic data suffers from mode collapse and stylistic uniformity. By rotating between GPT-5.2, Gemini 2.5, and Llama-3.3, we achieved:
- Diverse linguistic patterns
- Robust generalization across domains
- API rate limit distribution
Common synthetic datasets accept malformed outputs that hurt model performance. Our 3-layer validation ensures:
- Only production-ready examples enter training
- Unsafe operations are filtered
- Generic or parroted responses are rejected
The status field acts as a type discriminator, allowing the model to learn:
- When to invoke tools vs. answer directly
- Proper null handling (avoiding the "null tax")
- Clean separation between reasoning and action
Check the docs/ folder for research notes on:
- LoRA / PEFT: Why we freeze base weights and only train adapters
- Unsloth: Optimization techniques for 2x faster training with lower memory
- Data Hygiene: Schema validation rules and quality filtering
- Prompt Engineering: Meta-prompting architecture for generation
- Phase 1: Synthetic Data Engineering
- Phase 2: Dataset Publication on HuggingFace
- Phase 3: QLoRA Fine-Tuning with Unsloth
- Phase 4: Evaluation Suite (Accuracy, Latency, Safety)
- Phase 5: Production Deployment & Benchmarking
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Unsloth for efficient fine-tuning infrastructure
- Instructor for structured output generation
- Anthropic, OpenAI, Google for frontier model access
- HuggingFace for dataset hosting
Riya Sangwan - @ria-19
Dataset: tai-tai-sama/semantic-router-dataset
"Synthetic data is only as good as the validation pipeline that produces it."