Last update: 22 Sep 2025 • Owner: you • Scope: Backend + minimal UI hooks
NexusRust-MicroVM is a control plane and agent system that makes Firecracker microVMs feel as easy to operate as containers, while preserving VM isolation and boot speed. In 6 months you’ll ship a production-ready, multi-host manager with a clean API, basic UI, and strong operational guarantees (reconciliation, audit, metrics, snapshots, templates, and guarded passthrough to Firecracker).
North star: “Create, run, observe, and retire thousands of microVMs safely and repeatably.”
Business/UX
- <5 min time-to-first-VM on a fresh host (from agent install to VM running).
- <3 clicks / 1 API call to start/stop a VM from a saved template.
- Zero manual cleanup after crashes (agent/manager reconciliation cleans taps, sockets, scopes).
- P95 < 2s for CRUD calls (excluding image copies/snapshots).
Reliability
- Reconciler converges desired vs. actual state within 30s across hosts.
- SLO 99.5% manager API availability.
- <1% failed Firecracker calls not auto-retriable (e.g., permanent validation errors).
Security & governance
- Audit log coverage 100% for mutate ops.
- Role-based access control (RBAC) covering read, create, update, delete per project.
In-scope
- Multi-VM per host; multi-host fleet (N agents registering to 1 manager).
- Lifecycle: create → configure → start/stop/pause → delete.
- Storage: rootfs + additional drives (pre-boot), runtime rate-limit PATCH.
- Network: TAP on bridge, NAT via host; per-VM MAC auto; optional static IP via DHCP hook later.
- Snapshots (create/pause/resume/load) with local store paths.
- Templates (golden configs) and cloning.
- Metrics/logging wiring (Firecracker logger/metrics) + basic charts (API, later FE).
- Reconciler and GC (orphan scope/tap/socket detection).
- Audit log + RBAC + API tokens.
- Image registry abstraction (host paths allow-listed + named images table).
Out-of-scope (for this 6 months)
- Live migration; cross-host snapshot streaming.
- GPU/advanced devices, vsock guest-services.
- Full multi-tenant network overlays (we stick to bridge+NAT).
- Billing/cost engine (export raw metrics; no costing yet).
-
DevOps “Sarah”: Needs to run 50-500 ephemeral microVMs for jobs/tests.
- “Create 20 VMs from template
python-3.12and tear down in 2 hours.” - “See why a VM failed to start; get actionable error and logs.”
- “Create 20 VMs from template
-
Platform Eng “Mike”: Backend integration (CI, functions, worker fleet).
- “Programmatically clone template → run workload → snapshot or stop.”
- “Roll out a new kernel image across a pool with canarying.”
- Manager (Rust/Axum + Postgres/SQLx): Source of truth, REST API, reconciliation, audit, authz, orchestration to agents.
- Agent (Rust/Axum, root-capable): Host-local executor. Creates TAP/bridge, spawns Firecracker under transient
systemdscope, exposes native UDS → HTTP proxy to Firecracker (no socat), performs cleanup. - State: Postgres tracks VMs, hosts, templates, images, snapshots, audit events, API tokens.
- Traffic: Manager → Agent over HTTP(S). Agent → Firecracker via unix socket using
http+unix.
Below are the features you will finish in 6 months, grouped by capability. Each entry includes:
- What (user-visible behavior)
- Why (value)
- How (implementation outline)
- Acceptance criteria (tests you’ll run)
What
- Create a VM with: name, vCPU, mem MiB, kernel, rootfs, optional extra drives, network iface.
- Start/Stop/Pause/Delete VM; list and get details (state, host, socket path).
- Bulk: basic multi-select start/stop via API.
Why
- Core value: reliably spin up many isolated microVMs fast.
How
-
Manager creates a VM record (
requested→configured→running|stopped). -
Manager calls Agent:
POST /agent/v1/vms/:id/tap(ensure bridge; create TAP; idempotent).POST /agent/v1/vms/:id/spawn(systemd transient scopefc-<id>.scope, wait for UDS).PUTFCmachine-config,boot-source,drives,network-interfaces,logger,metricsvia Agent’s UDS proxy.PUTFCactions {InstanceStart}.
-
Stop: Agent stops scope + deletes TAP + removes UDS; Manager updates state.
-
Delete: Calls stop (best-effort) + removes DB row + log cleanup.
Acceptance
- Create VM returns
201withid, andGET /v1/vms/:idshowsrunningwithin 5s. - Stop sets state →
stoppingthenstopped(reconciler completes if agent crashes mid-stop). - Delete removes VM row; agent confirms TAP and UDS gone.
What
- Every VM has its own Firecracker socket (e.g.,
/srv/fc/vms/<id>/sock/fc.sock). - Manager calls Agent
/:id/proxy/*path?sock=<path>to relay HTTP to that socket.
Why
- Avoid per-VM TCP bridges; lower overhead and fewer moving parts.
How
- Agent uses an http+unix client to forward method/headers/body; strips
Host, sets timeouts. - Safety: Agent validates that
sockpath is within itsFC_RUN_DIRand under/srv/fc/vms/<id>/sock/.
Acceptance
- PUT to
/proxy/machine-configsucceeds with proper JSON schema, 4xx/5xx mapped through. - Invalid
sock(outside run dir) returns403.
What
- One bridge per host (default
fcbr0) with NAT via uplink; per-VM TAPtap-<uuid>. - Auto MAC generation; guest obtains L2 connectivity on the bridge.
Why
- Minimal, predictable networking for local workloads and simple clusters.
How
-
Agent:
ensure_bridge: create if missing,ip link set up,iptables -t nat MASQUERADE.create_tap: idempotent delete → create → attach to bridge →up.delete_tapon stop/delete.
-
Optional: later hook DHCP server (dnsmasq) for IP assignment (not required in 6 months).
Acceptance
- Bridge exists and is
UP; tap is visible and enslaved to bridge; teardown removes tap.
What
- Save a template (machine + boot + drives + NIC settings) with a human name.
- “Create from template” with minimal overrides (name, tags).
Why
- Reduces friction; standardization for teams.
How
- DB table
template {id, name, config_json, owner, created_at}. - Manager endpoint
POST /v1/templates,POST /v1/templates/:id/instantiate. - Manager hydrates template into VM create flow; persists link.
Acceptance
- Creating VM from template requires only
name+ optional tags. - Diff view shows overrides vs. base template (API returns both).
What
- Pause a running VM → create snapshot (state + drives per FC semantics) → resume or stop.
- Load snapshot into a new VM (not hot-swap in same process).
- Snapshot list per VM; metadata (size, paths, timestamp).
Why
- Fast warm-starts; stateful recovery; golden images for workloads.
How
-
Manager orchestrates:
PUT /actions {InstancePause}→PUT /snapshot/create(paths under/srv/fc/vms/<id>/snapshots/...) → optionalInstanceResume.- Load: create a fresh VM, configure boot/snapshot load according to FC API, then start.
-
Enforce disk path allow-list.
Acceptance
- Snapshot create returns metadata (size, files); load yields a running VM with same state.
- Errors surface from Firecracker with actionable
fault_message.
What
- Logger → file path; metrics → JSON file per VM.
- API to tail logs and fetch recent metrics; optional SSE stream.
Why
- Operators need quick visibility for failures/perf.
How
-
During create, Manager sets FC
/loggerand/metrics. -
API:
GET /v1/logs/tail?path=…(DEV); later SSE endpointGET /v1/logs/stream?id=….GET /v1/metrics/recent?id=…reads metrics JSON.
Acceptance
- Log lines appear as VM boots; metrics JSON updates; endpoints respond within 200ms for small files.
What
- Periodic loop compares DB desired state vs. actual host reality.
- Fixes drift: restarts missing scopes if desired
running, cleans orphan taps/sockets, updates DB when VMs die.
Why
- Resilience to crashes, deploys, and manual host changes.
How
-
Agent “introspect” endpoint: returns live scopes, taps, and sockets.
-
Manager cron (every 15s): for each VM in
running, verify:- Agent sees
fc-<id>.scopeactive and socket present; if not, try restart; otherwise markstopped. - Remove orphan resources under
/srv/fc/vms/*not present in DB (grace policy).
- Agent sees
-
Idempotent, rate-limited retries; backoff.
Acceptance
- Kill agent/manager or firecracker randomly; within 30s, state converges (no resource leaks).
What
- API tokens scoped to a user or service; roles:
viewer,operator,admin. - Permission checks per project (namespace), per resource type.
Why
- Safe multi-user operation.
How
- Manager issues tokens (PBKDF2/Argon hashed); stores in
api_tokens. - Middleware resolves token → user → roles → allows or denies.
- Projects: simple label on VMs/templates for scoping.
Acceptance
- Requests without token → 401.
- Viewer cannot mutate; Operator can mutate VMs in their project; Admin full access.
What
- Every mutate call logs: who, what, when, before/after summary, result (allow/deny).
Why
- Accountability & debugging.
How
- Manager writes audit rows on entry (pending) and on exit (success/fail), with request_id.
Acceptance
- Export audit for a VM shows a complete history with timestamps and caller identity.
What
- Named images (kernel/rootfs) with validated host paths; prevent arbitrary path injection.
Why
- Hygiene & security; reproducibility.
How
- Tables:
image {id, kind(kernel|rootfs), name, host_path, sha256(optional), size}. - VM create accepts
kernel_image_id/rootfs_image_idor raw path (disabled in prod).
Acceptance
- VM create fails if image id not allowed for caller’s project; succeeds if valid.
All JSON; examples abbreviated.
POST /v1/vms→{id}body:{ name, vcpu, mem_mib, kernel_image_id|kernel_path, rootfs_image_id|rootfs_path, extra_drives[], nic? }GET /v1/vms→{ items: [...] }GET /v1/vms/:id→{ item: {...} }POST /v1/vms/:id/stop→{ ok: true }DELETE /v1/vms/:id→{ ok: true }
Snapshots
POST /v1/vms/:id/snapshots(pause→create) →{ snapshot_id, size, files }POST /v1/snapshots/:sid/instantiate→{ id }
Templates
POST /v1/templates/GET /v1/templates/POST /v1/templates/:id/instantiate
Images
POST /v1/images/GET /v1/images(kind=kernel|rootfs)
Logs & metrics
GET /v1/logs/tail?path=…(dev)GET /v1/metrics/recent?id=…(dev)- (later)
GET /v1/logs/stream?id=…(SSE)
Admin/ops
GET /v1/hosts(agents and their capacities)POST /v1/tokens(admin only)GET /v1/audit?vm_id=…
Error contract
{ error, fault_message?, status, suggestion?, request_id } for non-2xx.
- vm:
id, name, state, host_id, host_addr, api_sock, tap, log_path, http_port, fc_unit, template_id?, created_at, updated_at - vm_event:
id, vm_id, at, level, message - template:
id, name, owner, config_json, created_at - snapshot:
id, vm_id, meta_json, path, size, created_at - image:
id, kind, name, host_path, sha256, size, created_at - host:
id, name, addr, capacity_json, last_seen_at - audit:
id, actor, action, resource, resource_id, request_id, at, before_json?, after_json?, result - api_token:
id, owner, hashed, role, project, created_at, last_used_at
- Validate all host paths against an allow-list root (e.g.,
/srv/fc/images,/srv/fc/vms). - Agent validates
?sock=to live under its run dir and match VM id prefix. - CORS: exact allow-list in production.
- Tokens: bearer in
Authorization; rotation and revocation endpoints. - Least-privilege
sudoersfor agent:systemd-run,systemctl stop,ip,iptables,sysctl.
-
Structured JSON logs (manager & agent) with
request_id, latency, upstream status. -
Prometheus metrics (manager): request counts, durations, reconciler actions, errors.
-
Health endpoints:
- Manager:
/healthz(liveness),/readyz(DB + at least one healthy agent) - Agent:
/agent/v1/health,/agent/v1/capacity
- Manager:
- Manager p95 handler < 200 ms (excluding Firecracker operations).
- Reconciler: bounded parallelism per host; backoff on failures.
- Tested to 1,000 VMs across 10 agents (100 each) with basic churn (create/start/stop/delete).
- Kernel/rootfs drift → images abstraction + immutability via checksums.
- iptables vs nftables → detect and support nftables mode or document requirement.
- Agent privilege → constrained sudoers + path allow-lists + audit.
- API misuse → façade only forwards allow-listed Firecracker endpoints.
M1 (Weeks 1-3): A1–A3 foundation
- Single-host: multi-VM lifecycle, UDS proxy, logs/metrics wiring, Postgres schema, CLI/API.
- Demo: create→start→stop→delete; log tail; recovery after manager restart.
M2 (Weeks 4-6): Reconciler & GC + Host registry
- Agent inventory endpoints; manager reconciler loop; orphan cleanup.
- Demo: kill firecracker; reconciler heals or marks stopped; no leaks.
M3 (Weeks 7-9): Templates & Images
- Template CRUD; image registry (kernel/rootfs) with allow-list paths.
- Demo: create from template in 1 request; prevent raw arbitrary paths in prod.
M4 (Weeks 10-12): Snapshots (local)
- Pause/create/load; snapshot catalog.
- Demo: capture snapshot, instantiate new VM from snapshot.
M5 (Weeks 13-16): AuthN/RBAC + Audit
- API tokens, roles (viewer/operator/admin), project scoping; full audit trail.
- Demo: different tokens show/limit operations; audit export for a VM.
M6 (Weeks 17-20): Multi-host scale & perf hardening
- Register multiple agents; host selection strategy (round-robin/resources); perf tests.
- Demo: 10 agents, 200 VMs; reconciler stable; metrics dashboards.
M7 (Weeks 21-24): Polishing & Ops
- Rate limits/quotas, SSE logs, better errors, docs, upgrade notes; optional FE tidy pages.
- Release candidate.
- Create/start/stop/delete OK across 10 hosts, 1k VMs test load.
- Reconciler converges; no orphan taps/sockets after fault injection.
- Snapshot create/load OK; templates OK; images guard paths OK.
- RBAC enforced; all mutating calls audited.
- Health/readiness endpoints green; metrics exported.
- Docs: operator runbook (bridge setup, nftables note), API reference.
# DB
./scripts/dev-up.sh
# Host network
sudo ./scripts/fc-bridge-setup.sh fcbr0 <uplink>
# Agent
AGENT_BIND=127.0.0.1:9090 FC_RUN_DIR=/srv/fc FC_BRIDGE=fcbr0 \
(cd apps/agent && cargo run)
# Manager
DATABASE_URL=postgres://nexus:nexus@localhost:5432/nexus \
AGENT_BASE=http://127.0.0.1:9090 \
(cd apps/manager && sqlx migrate run && cargo run)
This plan lets you “use what Firecracker already has +++”: we orchestrate its official API safely (pre-boot PUTs, runtime PATCH where allowed), add reconcilers, templates, snapshots, and the control-plane concerns Firecracker intentionally leaves to you. If you want, I can translate this PRD into GitHub epics/issues next so you can sprint immediately.
PM Plan:
Below is the same project-management phase/board I outlined earlier, cleaned up and ready to drop into your tracker (GitHub Projects/Jira/Linear). It maps exactly to the PRD and the feature slices we designed.
Workstreams
- A. Agent (privileged host service)
- B. Manager (control plane API + DB)
- C. Networking (bridge/TAP/NAT)
- D. Orchestration (reconciler/GC, multi-host)
- E. Images & Templates
- F. Snapshots
- G. Security & Governance (AuthN/RBAC, Audit)
- H. Observability & Ops (metrics, logs/SSE, health, docs)
Cadence
- 12 × 2-week sprints (Weeks 1–24)
- Demo + retro every sprint; release candidate in Weeks 21–24
- A1 Agent minimal: systemd scope spawn, UDS proxy, health/capacity
- A2 Manager minimal: create→configure→start, list/get/stop/delete, Postgres migrations
- A3 Logs & metrics wiring: FC /logger and /metrics, dev log tail endpoint
Exit criteria
- Create→start→stop→delete works on one host
- No socat; per-VM unix sockets
- Operator quickstart doc
- D1 Agent inventory endpoint (live scopes/taps/sockets)
- D2 Manager reconciler loop (every 15s): heal or mark stopped; GC orphans
- B3 Host registry: register agent, heartbeat, basic host selection (single host OK)
Exit criteria
- Kill firecracker or agent; system converges in ≤30s
- No resource leaks after chaos tests
- E1 Templates CRUD; instantiate
- E2 Image registry (kernel/rootfs), allow-listed host paths; PRD passthrough disabled in prod
Exit criteria
- Create VM from template with 1 request
- Unsafe raw paths blocked unless dev mode
- F1 Pause→snapshot→resume
- F2 Load snapshot into new VM; snapshot catalog
Exit criteria
- Snapshot create/load works locally, with metadata surfaced via API
- G1 API tokens, roles (
viewer,operator,admin) - G2 Project scoping; policy checks on mutate ops
- G3 Audit log for every mutate (who/what/when/before/after/result)
Exit criteria
- Unauthorized actions blocked; audit trail complete for one VM lifecycle
- D3 Multi-agent scheduling: round-robin or capacity-aware
- H1 Prometheus metrics exposed; perf tests; quotas/rate-limits
Exit criteria
- 10 agents / 200 VMs churn test stable
- Manager p95 < 200 ms (excluding FC work)
- H2 SSE logs streaming; better errors; docs & runbooks
- B4 Upgrade notes & migration scripts; RC hardening
Exit criteria
- Release candidate tagged; operator docs complete
| Sprint | Focus | Key Issues (abbrev) |
|---|---|---|
| 1 | Agent & Manager skeletons | A1-001 agent scaffold, A2-001 manager scaffold, C-001 bridge script |
| 2 | Create/Start/Stop + UDS proxy | A1-010 systemd scope spawn, A1-020 UDS proxy, A2-020 orchestration |
| 3 | Logs & metrics wiring | A3-010 logger, A3-020 metrics, B-010 migrations harden |
| 4 | Host registry | D-010 agent register/heartbeat, B-020 host table |
| 5 | Reconciler/GC | D-020 reconciler, D-030 orphan GC, chaos tests |
| 6 | Reliability fixes | D-040 retries/backoff, A-040 startup reaping, docs v1 |
| 7 | Templates | E1-010 CRUD, E1-020 instantiate |
| 8 | Images | E2-010 registry, E2-020 path policy, E2-030 prod toggle |
| 9 | UX & perf | E-polish errors, B-060 perf pass (DB indexes) |
| 10 | Snapshots create | F1-010 pause/create/resume, F1-020 metadata |
| 11 | Snapshots load | F2-010 load new VM, F2-020 catalog |
| 12 | Snapshots cleanup | F-polish GC, F-tests |
| 13 | Auth tokens | G1-010 token CRUD, middleware |
| 14 | RBAC | G2-010 roles, G2-020 project scope |
| 15 | Audit | G3-010 audit writer, G3-020 export |
| 16 | Security pass | G-polish threat model, path allow-lists |
| 17 | Multi-host schedule | D3-010 selection policy |
| 18 | Metrics & perf | H1-010 Prom metrics, load tests |
| 19 | Quotas/limits | H1-020 rate limits, guardrails |
| 20 | Stabilization | H-bugfixes, scale test 10 agents |
| 21 | SSE logs | H2-010 log streaming |
| 22 | Errors/Docs | H-docs runbooks, error mapping |
| 23 | RC hardening | test matrix, DR drills |
| 24 | Release | tag RC, handover |
Example: A1-020 — Agent: http+unix UDS proxy
-
Type: Feature
-
Desc: Forward HTTP to Firecracker unix socket without socat
-
Acceptance:
PUT /proxy/machine-config?sock=<vm sock>returns FC status- Reject paths outside
FC_RUN_DIR(403) - Timeouts mapped to 504
-
Tasks:
- Build client; header filtering; path validation; tests
-
Risk: Path traversal → mitigate with canonicalize + prefix check
Example: D-020 — Reconciler
-
Type: Feature
-
Acceptance:
- After SIGKILL firecracker, state converges ≤30s
- No orphan TAP/UDS after GC
-
Tasks: Agent inventory; loop; retries/backoff; metrics
- Kernel/rootfs availability → blocks M1 demos
- Agent sudoers (systemd, ip, iptables, sysctl) → blocks A1 spawn & C networking
- DB schema → blocks Manager features (A2+)
- Agent inventory → blocks reconciler (M2)
- API behavior covered by integration tests
- Structured logs include
request_id, upstream status, duration - Error contract
{ error, fault_message?, status, suggestion?, request_id } - Docs updated (runbook, API)
- Security checks: path allow-lists; auth (once G done)
- Observability: Prom metrics for major operations
- Lifecycle: create→start→stop→delete; repeat 100× (leak check)
- Reconciler: kill agent, kill firecracker, delete tap/uds manually; system recovers
- Snapshots: stateful app resumes correctly (e.g., in-guest counter)
- RBAC: viewer blocked on mutate; operator restricted to project
- Scale: 10 agents × 100 VMs churn; manager p95 < 200 ms (excluding FC)
- Security: path traversal attempts on
?sock=and images; blocked
- Owner/PM/Tech lead: you (R/A)
- Agent: you or agent sub-lead (R)
- Manager/API/DB: you or manager sub-lead (R)
- Security: reviewer pass per milestone (C)
- Ops: infra scripts & docs (R)
- QA: integration + chaos tests (R)
- Burndown per sprint
- Build health: CI green, test counts, coverage (core paths)
- Perf dashboards: request latency histograms, reconciler actions/min
- Error budget: non-actionable 5xx < 0.5%/sprint
- M1: One-host happy path; operator quickstart
- M2: Convergence under faults; no orphans
- M3: Template/Images guardrails on; prod mode blocks raw paths
- M4: Snapshot create/load; catalog API
- M5: Tokens + RBAC + full audit; security review
- M6: Multi-host load test passes; metrics exported
- M7: SSE logs, docs/runbooks, RC tag
- Create project board with epics M1–M7 + initial issues (Sprints 1–2)
- Lock sudoers for dev; verify
firecrackerin PATH - Prepare two real images (kernel/rootfs) for M1 demo
- Wire CI (fmt, clippy, test) +
sqlxoffline schema cache (optional) - Start Sprint 1: agent/manager skeletons + bridge script
If you want, I can turn this into a ready-to-import GitHub Projects CSV or Jira epics with all the issue titles and acceptance criteria prefilled.