Skip to content

perf: Job acquisition performance — upgrade get_job, fix queue_name default, add partial indexes #60

@devin-ai-integration

Description

@devin-ai-integration

Problem

The @pgpm/jobs and @pgpm/database-jobs packages (v0.15.5) have three compounding performance issues in the job acquisition path that, together, produce the worst-case scenario documented in upstream Graphile Worker issue #240. Under queue contention, job acquisition can be 20x–50x slower than it needs to be.

1. queue_name defaults to gen_random_uuid()::text instead of NULL

Both jobs and scheduled_jobs tables define:

queue_name text DEFAULT (public.gen_random_uuid())::text,

Upstream Graphile Worker made queue_name nullable with no default in v0.4.0 (migration 000003), describing it as a "significant performance enhancement" because it "changes database schema such that a job_queue record is only added/checked when necessary."

With the random UUID default, every job creates a unique queue — meaning:

  • The increase_job_queue_count trigger fires on every insert, creating a job_queues row per job
  • get_job must evaluate the expensive EXISTS ... FOR UPDATE SKIP LOCKED subquery on job_queues for every candidate row
  • The job_queues table grows unboundedly with unique entries
  • Queue bookkeeping runs at maximum cost for zero benefit

Note: The queue-count triggers already have WHEN (NEW.queue_name IS NOT NULL) guards — so the fix is just changing the default.

2. get_job uses the old row-by-row queue checking pattern (Strategy 1)

The current get_job checks queue eligibility per candidate job row:

AND (jobs.queue_name IS NULL
  OR EXISTS (
    SELECT 1 FROM app_jobs.job_queues
    WHERE job_queues.queue_name = jobs.queue_name
      AND (job_queues.locked_at IS NULL
        OR job_queues.locked_at < (v_now - job_expiry))
    FOR UPDATE SKIP LOCKED
  ))

This is Strategy 1 (per-row EXISTS) in the upstream nomenclature. Upstream issue #240 documented this exact query going from 242ms to 13ms (18x) when replaced with Strategy 2 (pre-computed eligible queue set). The upstream benchmarks show:

Strategy Description Jobs/sec
0 Skip named queues entirely 11,800
1 Per-row EXISTS (current pgpm approach) ~40
2 Pre-lock queues upfront (recommended) 843
3 Exclude locked queues (unsafe) 600

3. Legacy ordering index

The current index:

CREATE INDEX priority_run_at_id_idx ON app_jobs.jobs (priority, run_at, id);

Upstream v0.14 replaced this with partial covering indexes filtered on an is_available generated column:

CREATE INDEX jobs_main_index ON jobs USING btree (priority, run_at)
  INCLUDE (id, task_id, job_queue_id) WHERE (is_available = true);

The partial index excludes locked/exhausted jobs from the index entirely, making it smaller and faster to scan.


Solution

Phase 1: Quick Wins (High Impact, Low Risk)

1a. Fix queue_name default to NULL

In both @pgpm/jobs and @pgpm/database-jobs, change the jobs and scheduled_jobs table definitions:

-- Before:
queue_name text DEFAULT (public.gen_random_uuid())::text,

-- After:
queue_name text DEFAULT NULL,

Audit any downstream code that assumes queue_name is always non-null. The existing get_job, complete_job, fail_job, and triggers already handle NULL correctly.

1b. Rewrite get_job with pre-computed queue set (Strategy 2)

Replace the row-by-row EXISTS with a single pre-computed subquery:

-- Key change in the WHERE clause:
AND (jobs.queue_name IS NULL
  OR jobs.queue_name IN (
    SELECT jq.queue_name
    FROM app_jobs.job_queues jq
    WHERE (jq.locked_at IS NULL
        OR jq.locked_at < (v_now - job_expiry))
    FOR UPDATE SKIP LOCKED
  ))

The subquery is evaluated once and PostgreSQL uses a hashed semi-join, eliminating the per-row queue-lock check.

Phase 2: Index Improvements

2a. Add is_available generated column

ALTER TABLE app_jobs.jobs ADD COLUMN is_available boolean
  GENERATED ALWAYS AS ((locked_at IS NULL) AND (attempts < max_attempts)) STORED;

2b. Replace legacy index with partial covering indexes

CREATE INDEX CONCURRENTLY jobs_main_index
  ON app_jobs.jobs USING btree (priority, run_at)
  INCLUDE (id, queue_name) WHERE (is_available = true);

CREATE INDEX CONCURRENTLY jobs_no_queue_index
  ON app_jobs.jobs USING btree (priority, run_at)
  INCLUDE (id) WHERE (is_available = true AND queue_name IS NULL);

-- After benchmarking:
DROP INDEX priority_run_at_id_idx;

Phase 3: Modernization Features

  • Add revision integer DEFAULT 0 and flags jsonb columns
  • Add forbidden_flags parameter to get_job
  • Change notify trigger from FOR EACH ROW to FOR EACH STATEMENT
  • Add remove_job(job_key text) function
  • Add force_unlock_workers(worker_ids text[]) function

Phase 4: Optional — Full v0.14 Table Rebuild

  • tasks lookup table with integer FK (replaces text task_identifier on hot path)
  • job_queue_id integer FK (replaces text queue_name join)
  • smallint for priority/attempts/max_attempts

Phase 4 is optional — the performance gains are marginal compared to Phases 1–2, and the migration risk is higher (especially preserving the database_id multi-tenant design).


Expected Impact

Change Expected Improvement
Fix queue_name default ~2–5x for add_job (eliminates queue bookkeeping)
Rewrite get_job (Strategy 2) ~20x acquisition speedup under contention
Partial indexes with is_available ~2–3x for get_job on large tables
Combined Phase 1 + 2 Estimated 10–50x throughput improvement

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions