Local mock server that speaks the same HTTP as OpenAI, Anthropic, and 8 other LLM providers.
Point your SDK at localhost, inject errors, and see if your retry logic actually works.
pip install llmock
llmock serve --error-rate 429=0.3from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="anything")
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
# 70% of calls succeed. 30% raise RateLimitError.
# Your tenacity/backoff wrapper either handles it or it doesn't.LLMock is built for failure-path work that tends to get skipped until it breaks in staging:
- Retry and backoff against realistic
429,503, or any other4xx/5xxstatus - Provider fallback chains where one upstream fails and the next one should take over
- CI suites that need deterministic HTTP responses instead of flaky live providers
- Batch pipelines that should exercise upload, polling, completion, and result download logic
- Framework integrations where you want the real SDK call path, not a monkeypatched client
You can mock client.chat.completions.create and return a fake object. That tests your business logic, and that's fine.
But it doesn't test the HTTP layer — retries, connection errors, status codes, Retry-After headers, provider-specific error payloads. If you use LangChain or LlamaIndex, the mock often gets bypassed entirely because the framework wraps the SDK call.
LLMock runs as a real server. Your SDK builds a real request, sends it over HTTP, and parses a real response. The error payloads match what OpenAI, Anthropic, or Gemini actually return. If your retry logic works against LLMock, it'll work against the real API.
- A local FastAPI server with OpenAI-compatible and provider-native routes
- A chaos layer for latency plus arbitrary
400-599status injection - A deterministic response generator for chat, embeddings, images, models, and batch payloads
- A lightweight way to test resilience locally, in demos, and in CI
- Not a proxy or traffic interceptor. You must point your client at LLMock explicitly.
- Not a streaming simulator.
stream=truereturns501so streaming callers fail fast instead of hanging. - Not a model-quality emulator. The goal is transport realism and workflow realism, not semantic realism.
That boundary matters. It keeps the tool predictable and honest.
| Provider | LLMock base URL |
|---|---|
| OpenAI | http://127.0.0.1:8000/v1 |
| Anthropic | http://127.0.0.1:8000/anthropic |
| Mistral | http://127.0.0.1:8000/mistral/v1 |
| Cohere | http://127.0.0.1:8000/cohere/v2 |
| Google Gemini | http://127.0.0.1:8000/gemini/v1beta |
| Groq | http://127.0.0.1:8000/groq/openai/v1 |
| Together AI | http://127.0.0.1:8000/together/v1 |
| Perplexity | http://127.0.0.1:8000/perplexity/v1 |
| AI21 | http://127.0.0.1:8000/ai21/v1 |
| xAI (Grok) | http://127.0.0.1:8000/xai/v1 |
# Terminal 1
llmock serve --error-rate 429=0.3
# Terminal 2
pip install openai tenacity
python examples/retry_with_openai.pyYou should see successful responses mixed with retry logging, all without calling a real provider.
curl http://127.0.0.1:8000/v1/chat/completions \
-H "x-llmock-force-status: 503" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o","messages":[{"role":"user","content":"test"}]}'That header is useful when you want a test to hit a very specific branch immediately instead of waiting for probabilistic chaos.
llmock serveThen use the OpenAI-style file and batch endpoints:
POST /v1/filesPOST /v1/batchesGET /v1/batches/{id}GET /v1/files/{id}/content
LLMock keeps the state in memory and moves batches through a realistic async lifecycle so your polling and result handling code gets exercised too.
| Real API (gpt-4o) | LLMock | |
|---|---|---|
| Cost per request | ~$0.001 | $0 |
| CI suite (3k calls/day) | ~$74/month | $0/month |
| Trigger a 429 on demand | No | --error-rate 429=1.0 |
| Deterministic test output | No | Yes |
Under load: 5000 requests at 200 concurrent connections, 0 failures. Handles what you need for local testing and CI.
pipx install llmock # recommended: keeps it isolated
pip install llmock # also worksllmock serve starts on 127.0.0.1:8000 by default. Use --host, --port, env vars, or a config file to change that. Config precedence: CLI flags > env vars > config file > defaults.
import anthropic
client = anthropic.Anthropic(
base_url="http://127.0.0.1:8000/anthropic",
api_key="mock-key",
)
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}],
)
print(message.content[0].text)from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://127.0.0.1:8000/v1",
api_key="mock-key",
model="gpt-4o",
)
print(llm.invoke("Hello!").content)from llama_index.llms.openai import OpenAI
llm = OpenAI(
api_base="http://127.0.0.1:8000/v1",
api_key="mock-key",
model="gpt-4o",
)
print(llm.complete("Hello!").text)For complete scripts with retries, fallbacks, agents, and model comparisons, see examples/README.md.
Inject latency and HTTP failures to stress-test resilience logic:
llmock serve \
--latency-ms 200 \
--error-rate 429=0.25 \
--error-rate 500=0.10 \
--error-rate 503=0.10Environment variables work too:
LLMOCK_LATENCY_MS=200 \
LLMOCK_ERROR_RATE_429=0.25 \
LLMOCK_ERROR_RATE_500=0.10 \
llmock serve| Env var | CLI flag | Default | Description |
|---|---|---|---|
LLMOCK_HOST |
--host |
127.0.0.1 |
Bind address |
LLMOCK_PORT |
--port |
8000 |
Bind port |
LLMOCK_LATENCY_MS |
--latency-ms |
0 |
Fixed delay before responses |
LLMOCK_ERROR_RATE_<STATUS> |
--error-rate STATUS=RATE |
0.0 |
Probability for any 4xx or 5xx status |
LLMOCK_RESPONSE_STYLE |
--response-style |
varied |
Mock content style |
Any HTTP status from 400 to 599 can have its own probability. The combined probability must stay <= 1.0.
The /health endpoint always bypasses chaos so monitoring and smoke checks remain stable.
server:
host: 0.0.0.0
port: 9001
chaos:
latency_ms: 200
error_rates:
429: 0.25
500: 0.10
503: 0.10
responses:
style: echollmock serve --response-style hello| Style | Behavior |
|---|---|
static |
Always returns the same deterministic sentence |
hello |
Returns a short greeting from the requested model |
echo |
Echoes part of the incoming prompt |
varied |
Picks a deterministic variation based on model and prompt |
| Provider | Base path | Key endpoint |
|---|---|---|
| OpenAI | /v1 |
/v1/chat/completions |
| Anthropic | /anthropic |
/anthropic/v1/messages |
| Mistral | /mistral/v1 |
/mistral/v1/chat/completions |
| Cohere | /cohere/v2 |
/cohere/v2/chat |
| Google Gemini | /gemini/v1beta |
/gemini/v1beta/models/{model}:generateContent |
| Groq | /groq/openai/v1 |
/groq/openai/v1/chat/completions |
| Together AI | /together/v1 |
/together/v1/chat/completions |
| Perplexity | /perplexity/v1 |
/perplexity/v1/chat/completions |
| AI21 | /ai21/v1 |
/ai21/v1/chat/completions |
| xAI (Grok) | /xai/v1 |
/xai/v1/chat/completions |
All provider routes go through the same chaos middleware, so retry and failure behavior stays consistent while payload shape stays provider-specific.
LLMock works well anywhere you can swap the provider URL:
- LangChain ->
examples/langchain_retry.py - LlamaIndex ->
examples/llamaindex_pipeline.py - CrewAI ->
examples/crewai_resilient_agents.py - Raw OpenAI SDK + tenacity ->
examples/retry_with_openai.py
pytestThe test suite covers provider endpoints, chaos injection, error payload shapes, CLI configuration precedence, and batch behavior.
MIT