-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Self-hosted v4.4.3: docker-provider never processes runs (PENDING indefinitely)
Environment
- trigger.dev version: v4.4.3 (webapp, coordinator, docker-provider all same version)
- Deployment: Docker Swarm via Portainer CE
- OS: Debian 12 Bookworm
- Docker: 29.3.0
- PostgreSQL: 16
- Redis: 7
- ElectricSQL: latest
- ClickHouse: 25.8 (external, shared)
Description
Runs triggered via the API are accepted and enqueued in Redis, but the docker-provider never processes them. Runs remain in PENDING status indefinitely. Development mode works (via npx trigger.dev dev), but Production deployed execution does not.
What works
- Webapp: running, healthy, API responds correctly
- Coordinator: connects via WebSocket, receives DYNAMIC_CONFIG ✓
- Docker-provider: connects via WebSocket, receives SERVER_READY + PRE_PULL_DEPLOYMENT ✓
- ElectricSQL: running, replication active ✓
- Deployments:
DEPLOYEDstatus in Production (ic3etnvx, 20260326.2, 2 tasks) ✓ - Dev mode: health-check task runs COMPLETED_SUCCESSFULLY in Development ✓
- Registry: docker login works, credentials configured ✓
- API trigger: returns run ID successfully ✓
What doesn't work
- Runs in Production stay
PENDINGforever - Docker-provider shows zero activity after
SERVER_READY(no pull, spawn, create, task, or run logs) - Zero ephemeral task containers are ever created
TaskRunAttempttable: 0 rows (no attempts ever recorded)
Detailed investigation
1. Redis queues have the messages
engine:runqueue:workerQueue:cmn40rrgz0005qu1rihgeecsx-default → 3 messages (list type)
engine:runqueue:{org:...}:message:cmn7oqa7100011rqiy79ocv3w
engine:runqueue:{org:...}:message:cmn7nzcrp00001rqi88hoff19
engine:runqueue:{org:...}:message:cmn7nhm0v00091robjnlq74fg
Messages are correctly enqueued but never dequeued.
2. SharedQueueConsumer reports no messages
The webapp logs show:
{"reasonStats":{"no_message_dequeued":10},"actionStats":{},"outcomeStats":{"noop":10}}The consumers iterate but find nothing to dequeue, despite messages existing in the workerQueue.
3. WorkerInstanceGroup was manually created
The WorkerInstanceGroup table was empty (0 rows). We manually created:
INSERT INTO "WorkerInstanceGroup" (id, type, name, masterQueue, hidden, tokenId, organizationId, projectId, ...)
VALUES ('...', 'MANAGED', 'default', '<projectId>-default', false, '<tokenId>', '<orgId>', '<projectId>', ...);
UPDATE "Project" SET "defaultWorkerGroupId" = '<groupId>' WHERE id = '<projectId>';After this fix, the API stopped returning "No worker group found" and started accepting runs. But runs still don't execute.
4. Environment vars added to provider/coordinator
Initially, docker-provider and coordinator were missing DATABASE_URL, REDIS_HOST, REDIS_PORT, REDIS_PASSWORD. We added them (matching the webapp's values). No change in behavior.
5. Docker-provider logs (complete from startup)
new zod socket → ws://trigger-webapp:3030/provider
new zod socket → ws://trigger-webapp:3030/shared-queue
Initializing task operations
server listening on port 8809
connect (socket-provider) ✓
connect (socket-shared-queue) ✓
Incoming event SERVER_READY ✓
No checkpoint support: Please enable docker experimental features.
Simulation mode enabled. Containers will be paused, not checkpointed.
After this: complete silence. No dequeue, no pull, no spawn, no task activity.
6. Coordinator logs (complete from startup)
Docker mode
connecting → ws://trigger-webapp:3030/coordinator
server listening on port 9020
connect (socket-coordinator) ✓
Incoming event DYNAMIC_CONFIG ✓
Handling DYNAMIC_CONFIG (version v1, checkpointThresholdInMs 30000)
No checkpoint support: Please enable docker experimental features.
Simulation mode enabled.
After this: only healthcheck /health requests. Zero run-related activity.
Questions
-
Is the
WorkerInstanceGroupsupposed to be created automatically? In our self-hosted setup, bothWorkerInstanceGroupandWorkerGroupTokentables were empty after initial deployment. The Regions page shows "Default worker instance group not found" with no option to create one. -
What triggers the docker-provider to dequeue and process runs? It receives
SERVER_READYbut never seems to poll or receive run assignments. -
Is the
SharedQueueConsumerin the webapp supposed to read fromengine:runqueue:workerQueue:*and forward to the provider? It reportsno_message_dequeueddespite messages existing in the queue. -
Is there a missing env var or configuration step for self-hosted Production execution that isn't in the template? The official docker-compose.yml and .env.example don't mention anything about worker groups.
Compose structure
Using the official template structure adapted for Docker Swarm:
- webapp (ghcr.io/triggerdotdev/trigger.dev:v4.4.3)
- postgres (16)
- redis (7)
- electric (latest)
- docker-provider (ghcr.io/triggerdotdev/provider/docker:v4.4.3)
- coordinator (ghcr.io/triggerdotdev/coordinator:v4.4.3)
All on the same overlay network. Communication between services verified (HTTP + WebSocket).
Environment variables (provider)
PLATFORM_HOST=trigger-webapp
PLATFORM_WS_PORT=3030
SECURE_CONNECTION=false
PLATFORM_SECRET=<set>
COORDINATOR_HOST=trigger-coordinator
COORDINATOR_PORT=9020
REGISTRY_HOST=registry.junior.pro
REGISTRY_NAMESPACE=dev/utmlab/trigger
REGISTRY_USERNAME=<set>
REGISTRY_PASSWORD=<set>
DATABASE_URL=postgresql://trigger:<pass>@trigger-postgres:5432/trigger
REDIS_HOST=trigger-redis
REDIS_PORT=6379
REDIS_PASSWORD=<set>
NODE_ENV=production
V3_ENABLED=true
RUNTIME_PLATFORM=docker-compose