LLMock

Local mock server that speaks the same HTTP as OpenAI, Anthropic, and 8 other LLM providers.
Point your SDK at localhost, inject errors, and see if your retry logic actually works.

Try it

pip install llmock
llmock serve --error-rate 429=0.3

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="anything")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
# 70% of calls succeed. 30% raise RateLimitError.
# Your tenacity/backoff wrapper either handles it or it doesn't.

What Developers Use It For

LLMock is built for failure-path work that tends to get skipped until it breaks in staging:

Retry and backoff against realistic 429, 503, or any other 4xx/5xx status
Provider fallback chains where one upstream fails and the next one should take over
CI suites that need deterministic HTTP responses instead of flaky live providers
Batch pipelines that should exercise upload, polling, completion, and result download logic
Framework integrations where you want the real SDK call path, not a monkeypatched client

Why not just `unittest.mock`?

You can mock client.chat.completions.create and return a fake object. That tests your business logic, and that's fine.

But it doesn't test the HTTP layer — retries, connection errors, status codes, Retry-After headers, provider-specific error payloads. If you use LangChain or LlamaIndex, the mock often gets bypassed entirely because the framework wraps the SDK call.

LLMock runs as a real server. Your SDK builds a real request, sends it over HTTP, and parses a real response. The error payloads match what OpenAI, Anthropic, or Gemini actually return. If your retry logic works against LLMock, it'll work against the real API.

What LLMock Is

A local FastAPI server with OpenAI-compatible and provider-native routes
A chaos layer for latency plus arbitrary 400-599 status injection
A deterministic response generator for chat, embeddings, images, models, and batch payloads
A lightweight way to test resilience locally, in demos, and in CI

What LLMock Is Not

Not a proxy or traffic interceptor. You must point your client at LLMock explicitly.
Not a streaming simulator. stream=true returns 501 so streaming callers fail fast instead of hanging.
Not a model-quality emulator. The goal is transport realism and workflow realism, not semantic realism.

That boundary matters. It keeps the tool predictable and honest.

Supported Providers

Provider	LLMock base URL
OpenAI	`http://127.0.0.1:8000/v1`
Anthropic	`http://127.0.0.1:8000/anthropic`
Mistral	`http://127.0.0.1:8000/mistral/v1`
Cohere	`http://127.0.0.1:8000/cohere/v2`
Google Gemini	`http://127.0.0.1:8000/gemini/v1beta`
Groq	`http://127.0.0.1:8000/groq/openai/v1`
Together AI	`http://127.0.0.1:8000/together/v1`
Perplexity	`http://127.0.0.1:8000/perplexity/v1`
AI21	`http://127.0.0.1:8000/ai21/v1`
xAI (Grok)	`http://127.0.0.1:8000/xai/v1`

Three Workflows To Try

1. Retry logic with real SDK calls

# Terminal 1
llmock serve --error-rate 429=0.3

# Terminal 2
pip install openai tenacity
python examples/retry_with_openai.py

You should see successful responses mixed with retry logging, all without calling a real provider.

2. Force a fallback path on demand

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "x-llmock-force-status: 503" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o","messages":[{"role":"user","content":"test"}]}'

That header is useful when you want a test to hit a very specific branch immediately instead of waiting for probabilistic chaos.

3. Exercise a batch workflow locally

llmock serve

Then use the OpenAI-style file and batch endpoints:

POST /v1/files
POST /v1/batches
GET /v1/batches/{id}
GET /v1/files/{id}/content

LLMock keeps the state in memory and moves batches through a realistic async lifecycle so your polling and result handling code gets exercised too.

Numbers

	Real API (gpt-4o)	LLMock
Cost per request	~$0.001	$0
CI suite (3k calls/day)	~$74/month	$0/month
Trigger a 429 on demand	No	`--error-rate 429=1.0`
Deterministic test output	No	Yes

Under load: 5000 requests at 200 concurrent connections, 0 failures. Handles what you need for local testing and CI.

Install

pipx install llmock   # recommended: keeps it isolated
pip install llmock    # also works

llmock serve starts on 127.0.0.1:8000 by default. Use --host, --port, env vars, or a config file to change that. Config precedence: CLI flags > env vars > config file > defaults.

More SDK examples

Anthropic

import anthropic

client = anthropic.Anthropic(
    base_url="http://127.0.0.1:8000/anthropic",
    api_key="mock-key",
)
message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}],
)
print(message.content[0].text)

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="mock-key",
    model="gpt-4o",
)
print(llm.invoke("Hello!").content)

LlamaIndex

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    api_base="http://127.0.0.1:8000/v1",
    api_key="mock-key",
    model="gpt-4o",
)
print(llm.complete("Hello!").text)

For complete scripts with retries, fallbacks, agents, and model comparisons, see examples/README.md.

Chaos Configuration

Inject latency and HTTP failures to stress-test resilience logic:

llmock serve \
  --latency-ms 200 \
  --error-rate 429=0.25 \
  --error-rate 500=0.10 \
  --error-rate 503=0.10

Environment variables work too:

LLMOCK_LATENCY_MS=200 \
LLMOCK_ERROR_RATE_429=0.25 \
LLMOCK_ERROR_RATE_500=0.10 \
llmock serve

Configuration reference

Env var	CLI flag	Default	Description
`LLMOCK_HOST`	`--host`	`127.0.0.1`	Bind address
`LLMOCK_PORT`	`--port`	`8000`	Bind port
`LLMOCK_LATENCY_MS`	`--latency-ms`	`0`	Fixed delay before responses
`LLMOCK_ERROR_RATE_<STATUS>`	`--error-rate STATUS=RATE`	`0.0`	Probability for any `4xx` or `5xx` status
`LLMOCK_RESPONSE_STYLE`	`--response-style`	`varied`	Mock content style

Any HTTP status from 400 to 599 can have its own probability. The combined probability must stay <= 1.0.

The /health endpoint always bypasses chaos so monitoring and smoke checks remain stable.

Config file format

server:
  host: 0.0.0.0
  port: 9001

chaos:
  latency_ms: 200
  error_rates:
    429: 0.25
    500: 0.10
    503: 0.10

responses:
  style: echo

Response Styles

llmock serve --response-style hello

Style	Behavior
`static`	Always returns the same deterministic sentence
`hello`	Returns a short greeting from the requested model
`echo`	Echoes part of the incoming prompt
`varied`	Picks a deterministic variation based on model and prompt

Provider Endpoints

Provider	Base path	Key endpoint
OpenAI	`/v1`	`/v1/chat/completions`
Anthropic	`/anthropic`	`/anthropic/v1/messages`
Mistral	`/mistral/v1`	`/mistral/v1/chat/completions`
Cohere	`/cohere/v2`	`/cohere/v2/chat`
Google Gemini	`/gemini/v1beta`	`/gemini/v1beta/models/{model}:generateContent`
Groq	`/groq/openai/v1`	`/groq/openai/v1/chat/completions`
Together AI	`/together/v1`	`/together/v1/chat/completions`
Perplexity	`/perplexity/v1`	`/perplexity/v1/chat/completions`
AI21	`/ai21/v1`	`/ai21/v1/chat/completions`
xAI (Grok)	`/xai/v1`	`/xai/v1/chat/completions`

All provider routes go through the same chaos middleware, so retry and failure behavior stays consistent while payload shape stays provider-specific.

Integrations

LLMock works well anywhere you can swap the provider URL:

LangChain -> examples/langchain_retry.py
LlamaIndex -> examples/llamaindex_pipeline.py
CrewAI -> examples/crewai_resilient_agents.py
Raw OpenAI SDK + tenacity -> examples/retry_with_openai.py

Testing

pytest

The test suite covers provider endpoints, chaos injection, error payload shapes, CLI configuration precedence, and batch behavior.

Community

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github		.github
docs/assets		docs/assets
examples		examples
llmock		llmock
scripts		scripts
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
RELEASING.md		RELEASING.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMock

Try it

What Developers Use It For

Why not just `unittest.mock`?

What LLMock Is

What LLMock Is Not

Supported Providers

Three Workflows To Try

1. Retry logic with real SDK calls

2. Force a fallback path on demand

3. Exercise a batch workflow locally

Numbers

Install

More SDK examples

Anthropic

LangChain

LlamaIndex

Chaos Configuration

Configuration reference

Config file format

Response Styles

Provider Endpoints

Integrations

Testing

Community

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLMock

Try it

What Developers Use It For

Why not just unittest.mock?

What LLMock Is

What LLMock Is Not

Supported Providers

Three Workflows To Try

1. Retry logic with real SDK calls

2. Force a fallback path on demand

3. Exercise a batch workflow locally

Numbers

Install

More SDK examples

Anthropic

LangChain

LlamaIndex

Chaos Configuration

Configuration reference

Config file format

Response Styles

Provider Endpoints

Integrations

Testing

Community

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Why not just `unittest.mock`?

Packages