Skip to content

NikolaPantel/SQL-Analyst-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QueryForge Analytics

Personal LLM analytics product by Nikola Pantelic
GitHub: NikolaPantelic

Overview

QueryForge turns natural-language analytics questions into safe SQL, executes them on SQLite, and returns a concise answer with full execution metadata.

Pipeline stages:

  1. SQL generation (LLM)
  2. SQL validation
  3. SQL execution
  4. Answer generation (LLM)

Core goals:

  • Reliable SQL generation for analytical prompts
  • Strong query-safety guardrails
  • Production-grade observability and error handling
  • Good latency and token efficiency

Product Features

  • OpenRouter-backed LLM integration for SQL and answer synthesis
  • Multi-layer SQL validation (SELECT-only, forbidden operations, multi-statement blocking, column checks)
  • Token accounting (prompt_tokens, completion_tokens, total_tokens, llm_calls)
  • Structured timing and stage-level outputs in PipelineOutput
  • Benchmark script for repeatable latency/success measurement

Requirements

  • Python 3.13+
  • Dependencies in requirements.txt (openrouter, pandas)
  • OPENROUTER_API_KEY environment variable

Data Setup

This repository expects the Kaggle dataset file and converts it to SQLite:

  1. Open Gaming and Mental Health dataset
  2. Download gaming_mental_health_10M_40features.csv
  3. Place it in data/
  4. Build SQLite DB:
python3 -m pip install -r requirements.txt
python3 scripts/gaming_csv_to_db.py

Environment

Set your OpenRouter API key:

set OPENROUTER_API_KEY=<your_key>

Linux/macOS:

export OPENROUTER_API_KEY=<your_key>

Optional env vars:

  • OPENROUTER_MODEL (default model override)
  • LOG_LEVEL (default: INFO)

Run Tests

python3 -m unittest discover -s tests -p "test_public.py"
python3 -m unittest discover -s tests -p "test_validator.py"

Benchmark

python3 scripts/benchmark.py --runs 3

Outputs summary JSON including:

  • success_rate
  • avg_ms
  • p50_ms
  • p95_ms

Programmatic Usage

from src.pipeline import AnalyticsPipeline

pipeline = AnalyticsPipeline()
result = pipeline.run("Which addiction level bucket has the highest number of respondents?")

print(result.status)
print(result.sql)
print(result.answer)
print(result.total_llm_stats)

Repository Map

  • src/pipeline.py: orchestration, SQL validation, execution
  • src/llm_client.py: OpenRouter calls, prompting, token stats
  • src/types.py: dataclasses for stage and final outputs
  • scripts/gaming_csv_to_db.py: CSV to SQLite ingestion
  • scripts/benchmark.py: latency/success benchmark runner
  • tests/: integration and validation tests

Status

Current focus is single-turn analytics over one table (gaming_mental_health).
Planned extensions include multi-turn follow-up support, parser-based SQL validation, and metrics export integrations.

About

LLM-powered analytics engine that translates natural-language questions into safe SQL and clear data insights.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages