Skip to content

growlf/ai-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

173 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

ai-stack

A federated, observable AI cluster for your own hardware.

Run a full AI development stack — local LLMs, cloud augmentation, smart routing, and a live observability dashboard — on hardware you already own. No subscription, no token metering, no vendor lock-in.

CI License: MIT Ollama OpenCode


What is this?

ai-stack turns one or more machines into a federated AI herd — local models where possible, cloud where needed, with a live dashboard so you can see exactly what's running where.

  • Local-first — models run on your GPU/CPU; Ollama handles the inference
  • Cloud-augmented — Claude, Gemini, OpenAI, or OpenCode Zen via LiteLLM when local isn't enough
  • Smart routing — an LLM-based classifier picks the right model for each request automatically
  • Federated — Olla discovers peers and load-balances across the herd; each node contributes what its hardware can serve
  • Observable — Shepherd watches every node and surfaces the herd state at a glance

The Herd Vision

Every node running ai-stack is a member of a federated AI herd. Models live where the hardware fits; routing decisions span the herd. A 24GB GPU box runs the 14B models; a smaller box runs the 7B workhorses; a Pi-class node can contribute a classifier slot. When you send a query, the router picks the best model + Olla picks the best healthy node serving it.

graph TB
    subgraph NodeA["Node A — GPU 24GB"]
        OL_A[Ollama A<br/>qwen2.5-coder 14B<br/>deepseek-r1 14B]
        SN_A[shepherd-node A]
    end

    subgraph NodeB["Node B — Desktop 32GB"]
        OL_B[Ollama B<br/>llama3.1 8B<br/>qwen2.5 7B]
        SN_B[shepherd-node B]
    end

    subgraph NodeC["Node C — Laptop 8GB"]
        OL_C[Ollama C<br/>qwen2.5 1.5B classifier]
        SN_C[shepherd-node C]
    end

    subgraph ControlPlane["Control plane (one designated node)"]
        SC[shepherd-control<br/>:40117/ui]
        OLLA[Olla :40114<br/>federation + load-balance]
        RTR[Router :40115<br/>smart auto-pick]
        LL[LiteLLM :4000<br/>cloud gateway]
    end

    subgraph Cloud
        CLAUDE[Claude / Gemini /<br/>OpenCode Zen]
    end

    RTR --> OLLA
    OLLA -.federates.-> OL_A
    OLLA -.federates.-> OL_B
    OLLA -.federates.-> OL_C
    OLLA --> LL
    LL --> CLAUDE

    SC -.polls metrics.-> SN_A
    SC -.polls metrics.-> SN_B
    SC -.polls metrics.-> SN_C
Loading

Each node serves whatever models its hardware fits. Olla federates discovery + routing. Shepherd observes the herd. Router picks the best model per query.


Live Dashboards

Once running, open these in your browser:

Dashboard URL What you see
Shepherd (herd observability) http://<control-host>:40117/ Live per-node CPU/RAM/GPU + resident models + federation peers across the herd
Router gestalt http://<control-host>:40115/gestalt/ui Live routing decisions, model usage, SSE feed
LiteLLM admin http://<control-host>:4000/ui Cloud model usage, costs, API keys
Olla endpoint health http://<control-host>:40114/internal/status/endpoints Federation peer health

Shepherd herd dashboard

Shepherd is the herd's observability layer — a per-node sidecar (shepherd-node) reports system + hardware + Ollama state every few seconds, and a central control plane (shepherd-control) aggregates the herd into one live view. Each node appears as a card showing:

  • Hardware vendor + accelerator (NVIDIA GPU, Intel Arc/Iris, AMD ROCm, Apple Silicon, or CPU-only — auto-detected via probe adapters)
  • CPU / RAM / network gauges
  • Resident models + warm-state from Ollama
  • Federation peers visible via Olla

The dashboard refreshes via Server-Sent Events as state changes. When a model warms, when a node drops off the mesh, when a peer joins — the dashboard reflects it within seconds.

Router gestalt view

The router view (/gestalt/ui) shows a live feed of every routing decision — which model was selected, which node it was sent to, and why — as it happens. Send a query from OpenCode and watch the entry appear within a second.

To see these in action: docker compose up -d --build → open the Shepherd dashboard → send a query from OpenCode → watch the routing decision appear in the gestalt view simultaneously.


Architecture

 OpenCode (AI IDE + Obsidian plugin)
     │
     ├── tool ──▶  Retriever :42000      Obsidian vault RAG (hybrid BM25 + vector)
     │
     ├── provider ▶  Router :40115       LLM-based smart model classifier
     │                   │
     │               Olla :40114         Federation + load balancer
     │                   │
     │           ┌───────┴────────┐
     │         Ollama          LiteLLM
     │         :11434           :4000
     │       (per-node     (Claude / Gemini /
     │        local GPU)    OpenCode Zen)
     │
     └── provider ▶  LiteLLM :4000       Direct cloud access (optional)

 Shepherd (herd observability)
     │
     ├── shepherd-control :40117/        D3 dashboard + SSE stream to browser
     │                                   Polls each peer's /herd/metrics
     │
     └── shepherd-node :40116            Per-node sidecar — runs on every herd peer
         ├── /herd/metrics               System + hardware + Ollama + Olla snapshot
         ├── /herd/verify                Raw secondary-source data for divergence detection
         ├── /herd/schema                JSON Schema of the metrics document
         ├── /herd/capabilities          Orchestrator-facing resident-models summary
         └── /herd/healthz               Liveness check

Quick Start

git clone https://github.com/growlf/ai-stack.git
cd ai-stack
./install.sh

The installer auto-detects your GPU (NVIDIA, Intel Arc, or CPU-only), creates .env, generates your API key, and starts the stack. Open http://localhost:40117/ (Shepherd dashboard) when it completes.

Older Intel iGPU? (Iris Pro / Iris / UHD / Gen 9 etc., pre-Arc.) install.sh falls back to CPU-only for these — ipex-llm doesn't support pre-11th-Gen iGPUs. Use the Vulkan path instead:

./scripts/install-vulkan-ollama.sh   # native Ollama + Vulkan/Mesa ANV

See docs/hardware/intel-igpu-vulkan.md for the procedure and supported hardware list. The rest of ai-stack (Olla, LiteLLM, Router, Shepherd) still runs via docker compose — it just connects to the native Ollama on localhost:11434.

To add cloud models (Claude, Gemini, OpenCode Zen) after install:

echo 'ANTHROPIC_API_KEY=sk-ant-...' >> .env
echo 'GEMINI_API_KEY=AI...' >> .env
echo 'OPENCODE_ZEN_API_KEY=sk-...' >> .env
docker compose restart litellm

Full install guide → docs/install.md


Services

Service Port Purpose
Ollama 11434 Local LLM inference (CPU / CUDA / Arc / Iris / ROCm)
LiteLLM 4000 Cloud API gateway (Claude, Gemini, OpenAI, OpenCode Zen)
Olla 40114 Federation + load balancer across herd nodes
Router 40115 Smart model classifier + routing dashboard
Retriever 42000 Obsidian vault RAG — hybrid BM25 + vector search
Shepherd-node 40116 Per-node observability sidecar (runs on every herd peer)
Shepherd-control 40117 Central dashboard + herd aggregator (runs on one designated node)

Smart Routing

The Router classifies every request and picks the right model automatically:

Query type Model selected
Code / scripting qwen2.5-coder:14b
Deep reasoning deepseek-r1:14b
Long documents gemma3:12b
Tool calling llama3.1:8b
General chat qwen2.5:7b
Cloud / complex Claude / Gemini / Big Pickle via LiteLLM

The classifier uses a tiny local model (qwen2.5:1.5b) to categorize queries in ~100ms before routing. Cloud models are detected by name and passed through unchanged.


Shepherd — Herd Observability

Shepherd is ai-stack's observability layer — a lightweight pair of services that watches every node and surfaces the herd state at a glance. It supersedes the earlier monolithic Apostle agent with a smaller, honest-by-construction service: every claim the system makes about itself is cross-checked against an independent source.

What Shepherd does

  1. Introspects — each shepherd-node reads CPU, RAM, network, and hardware-accelerator state (NVIDIA via nvidia-smi, Intel Arc via SYCL/level-zero, AMD via ROCm-smi, etc.) and exposes a unified /herd/metrics JSON.
  2. Aggregatesshepherd-control polls each peer's metrics every few seconds and streams the herd view to the browser via SSE.
  3. Cross-verifies (v0.2 / Plan v3 — in progress) — three-source check on each node: container probe + host probe + Ollama self-report. Divergence between them surfaces as an explicit alert.
  4. Auto-recovers (v0.2 / Plan v3 — in progress) — when divergence persists, attempts docker restart of the affected service with circuit-breaker; alerts only if recovery fails.
  5. Snapshots known-good state (v0.2 — in progress) — captures kernel + driver + container digests + warm-test timing whenever the canary stays green; gives the herd a baseline to restore from rather than re-derive working configs after regressions.

Hardware probe support

Each shepherd-node loads a hardware-probe adapter matching the node's accelerator. The adapter pattern is vendor-pluggable: implementing a new hardware family is one file under shepherd/shepherd_node/probes/.

Hardware Probe status
NVIDIA ✅ implemented
CPU-only ✅ implemented
Intel Arc 🟡 stub — metrics pending kernel/ipex-llm work
Intel Iris / UHD 🟡 stub — metrics pending
AMD ROCm 🟡 stub — awaits first contributor
Apple Silicon 🟡 stub — awaits first contributor

See shepherd/README.md for the full Shepherd service docs, endpoint reference, and contribution guide.


Multi-Machine Setup (Federation)

Add remote Ollama nodes to your .env:

OLLAMA_REMOTE_1=http://192.168.1.10:11434
OLLAMA_REMOTE_2=http://192.168.1.20:11434

Then regenerate the Olla config:

scripts/generate-olla-config.sh
docker compose restart olla

For each remote node, also deploy shepherd-node so it appears on the dashboard:

# On the remote node, after cloning ai-stack:
scripts/shepherd-auto-deploy.sh node

The shepherd-auto-deploy.sh script supports a daily cron for self-updating peers — see docs/multi-machine.md for the full pattern (each peer auto-pulls main + redeploys nightly, so only the canonical control node needs manual updates).

Multi-machine guide → docs/multi-machine.md


Roadmap

  • Phase 1 — Federation + smart routing: Olla peer discovery, LiteLLM cloud gateway, Router LLM-classifier + per-tool routing
  • Phase 2 — Shepherd v0.1: per-node sidecar + control-plane dashboard, vendor-pluggable hardware probes, federation-aware multi-node view
  • Phase 3 (in progress)Integrity & self-healing (Plan v3): in-container GPU canary, three-source divergence detection (/herd/verify), auto-recover on divergence with circuit-breaker, working-state snapshots per hardware adapter, latency-regression cron
  • Phase 4 — Cross-model tuning: canonical skill format + per-tool translators so OpenCode / Claude Code / Cursor / Continue / etc. share one source of truth for skills and prompts
  • Phase 5 — Herd intelligence: load-aware routing weights, proactive model pre-loading based on cluster demand, mDNS peer discovery, browser/mobile/burst-cloud participants

Documentation

Guide Description
Install Full setup walkthrough
Getting started First steps after install
Multi-machine Connecting multiple nodes + shepherd-auto-deploy pattern
Smart router How model routing works
Model guide Choosing and managing models
Cloud models Configuring Claude, Gemini, OpenCode Zen, OpenAI
Hardware GPU setup (Arc, NVIDIA, AMD, CPU)
Shepherd service Observability layer — architecture + endpoints + probe contribution
Troubleshooting Common issues and fixes

Contributing

Issues and PRs welcome. See CONTRIBUTING.md for guidelines and SECURITY.md for responsible disclosure.

Especially welcome: hardware-probe implementations for non-NVIDIA accelerators. The Shepherd probe-adapter interface is the extension surface — one file per hardware family. See shepherd/README.md for the contribution pattern.


License

MIT — use freely, build freely.

About

A project to configure and install LLM resources (specifically for i9Ultra with XE iGPU) on Ubuntu

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors