A local REST API that serves a fine-tuned Phi-2 model behind FastAPI and MLX. Supports both synchronous and streaming generation.
- M4 MacBook Pro (Apple Silicon, arm64)
- Python 3.11
- FastAPI + uvicorn
- MLX (
mlx-lm) for model loading and inference - Pydantic for request/response validation
- Base model:
microsoft/phi-2
- Phi-2 weights cached locally via Hugging Face (
microsoft/phi-2). The firstmlx_lm.load(...)call will fetch them if missing. - LoRA adapter trained in the companion
ml-experimentproject, located at~/ml-experiment/experiments/2026-05-04-0014. If the path is missing, the API falls back to the base model.
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt./run.shThis starts uvicorn on 0.0.0.0:8000 with --reload. The model is
loaded once at startup via FastAPI's lifespan event and held in memory
for the life of the process.
Build and run with Docker Compose:
docker compose up --buildThe HuggingFace cache and LoRA adapter are mounted from your host so weights aren't re-downloaded inside the container.
To run detached:
docker compose up -d --buildTo stop:
docker compose downNote: This image targets arm64 (Apple Silicon) and uses MLX. For Linux deployment, use docker-compose.linux.yml instead.
| Target | Runtime | Dockerfile | Compose file |
|---|---|---|---|
| macOS (Apple Silicon) | MLX | Dockerfile | docker-compose.yml |
| Linux (arm64/x86_64) | PyTorch CPU | Dockerfile.linux | docker-compose.linux.yml |
docker compose -f docker-compose.linux.yml up --buildMLX requires macOS and the Metal GPU framework — it cannot run inside a Linux container. The Linux target swaps MLX for PyTorch CPU inference, which runs on any platform but is slower than MLX on Apple Silicon. The public API surface is identical between both targets.
| Target | Runtime | Inference Time |
|---|---|---|
| macOS (Apple Silicon) | MLX | ~0.02s |
| Linux (PyTorch CPU) | PyTorch CPU | ~23.5s |
MLX uses the M4 Neural Engine and GPU via Metal. The Linux/PyTorch target runs on CPU only — the gap closes significantly with a CUDA GPU.
| Method | Path | Description |
|---|---|---|
| GET | /health |
Liveness probe + model-loaded flag |
| GET | /model/info |
Model name, adapter path, load time in seconds |
| POST | /generate |
Synchronous completion for a prompt |
| POST | /generate/stream |
Streamed tokens, newline-separated |
503 is returned when the model isn't loaded; 422 for invalid input.
# Health
curl http://localhost:8000/health
# Model info
curl http://localhost:8000/model/info
# Generate
curl -X POST http://localhost:8000/generate \
-H 'Content-Type: application/json' \
-d '{"prompt": "Hello, my name is", "max_tokens": 64}'
# Stream
curl -N -X POST http://localhost:8000/generate/stream \
-H 'Content-Type: application/json' \
-d '{"prompt": "Hello, my name is", "max_tokens": 64}'- Synchronous
/generate: ~7–9 tokens/sec on M4. Total latency scales linearly withmax_tokens. - Streaming
/generate/stream: returns immediately; tokens flow asynchronously as the model produces them. Time-to-first-token is a better latency metric here than total elapsed time. /healthand/model/info: sub-2ms — no inference is performed.
Every request is logged to logs/requests.log via FastAPI middleware.
Each line records:
- timestamp
- endpoint path
- prompt, truncated to 100 characters (only for POSTs to
/generate*) - response time in milliseconds
- HTTP status code
The logs/ directory is created on startup if it doesn't exist and is
git-ignored.
- The base Phi-2 model is not instruction-tuned and tends to hallucinate URLs and loop on open-ended prompts. The LoRA adapter helps for in-domain prompts only.
- No authentication. Intended for local use; do not expose to the network without adding auth.
tokens_usedin the response is approximated by whitespace-splitting the output, not by re-tokenizing. It is indicative, not exact.
- Add API key authentication (e.g. a FastAPI dependency that checks an
Authorizationheader). - Add rate limiting (e.g.
slowapi). - Add HTTPS via an nginx reverse proxy in front of uvicorn.
- Swap Phi-2 for an instruction-tuned model (e.g. Phi-3-mini-instruct, Llama-3-Instruct) to remove the looping/hallucination issues.
- Deploy with Docker; pin Python and MLX versions in the image.
phi2-serving/
├── app/
│ ├── __init__.py
│ ├── schemas.py # Pydantic request/response models
│ ├── model.py # MLX model singleton + generate/stream
│ ├── main.py # FastAPI app, endpoints, middleware logging
│ ├── model_torch.py # PyTorch backend for Linux
│ └── main_torch.py # FastAPI app using PyTorch backend
├── logs/
│ └── requests.log # auto-generated request log (gitignored)
├── tests/
│ └── test_api.py # httpx-based smoke tests
├── Dockerfile # macOS/MLX image
├── Dockerfile.linux # Linux/PyTorch image
├── docker-compose.yml
├── docker-compose.linux.yml
├── run.sh # uvicorn launcher
├── requirements.txt
├── requirements-linux.txt
├── README.md
└── CLAUDE.md # architecture decisions & extension notes