Skip to content

Defer log_lines search_vector indexing off the insert path#4818

Open
stuartc wants to merge 3 commits into
mainfrom
defer-log-lines-search-vector
Open

Defer log_lines search_vector indexing off the insert path#4818
stuartc wants to merge 3 commits into
mainfrom
defer-log-lines-search-vector

Conversation

@stuartc
Copy link
Copy Markdown
Member

@stuartc stuartc commented May 30, 2026

Description

This changes how log_lines full-text search gets indexed — taking it off the
log-insert hot path to cut down the run:log channel timeouts we've been seeing
under heavy log volume.

The old AFTER INSERT trigger computed to_tsvector synchronously inside the
insert transaction (so it never actually deferred anything) and double-wrote
every row via a self-UPDATE. It also had no program_limit_exceeded guard, so
a single oversized message would abort the insert and roll back the whole batch
on the insert_all path.

Now inserts leave search_vector NULL and a background Oban worker
(Lightning.LogLines.SearchVectorWorker) backfills it out-of-band on a dedicated
search_indexing queue. There's a guarded safe_to_tsvector SQL function and a
partial index (WHERE search_vector IS NULL) so finding pending rows stays
cheap. Live log streaming is unaffected — it's push-based over PubSub and never
reads search_vector; only full-text log search now lags ingestion slightly,
typically under a minute.

Three migrations, ordered 1a→1b→1c: add safe_to_tsvector, add the partial index
(built per-partition CONCURRENTLY then attached to the parent), then drop the
trigger.

Closes #4425

Validation steps

  1. Run the migrations (mix ecto.migrate). Confirm the log_lines
    set_search_vector trigger is gone, safe_to_tsvector exists, and the
    partial index is VALID across the parent + all 100 partitions. The
    dataclips trigger should be untouched.
  2. Insert a log line — its search_vector is NULL and it isn't matched by log
    search yet.
  3. Run Lightning.LogLines.SearchVectorWorker — the row's search_vector is
    populated and now matches to_tsquery('english_nostop', …).
  4. Insert an oversized (>1MB) message — it drains to an empty vector rather than
    erroring, and doesn't get stuck retrying.
  5. mix test test/lightning/log_lines/search_vector_worker_test.exs test/lightning/runs_test.exs

Additional notes for the reviewer

  1. The worker drains newest-first and "snowballs" an immediate follow-up while
    there's backlog, falling back to a 1-minute cron heartbeat. Concurrency is 1
    on the dedicated queue (SKIP LOCKED makes bumping it safe later).
  2. Two subtle bits worth a close look: the snowball's Oban unique states are
    restricted to [:available, :scheduled] on purpose — the default includes
    :executing/:completed, which makes a running job dedup its own successor
    and kills the chain — and the index migration drops any INVALID leftover
    before re-creating, so a failed CONCURRENTLY build can be re-run.
  3. safe_to_tsvector is also meant as the template for the dataclip trigger fix
    (Dataclip insert times out building the search vector, which loses the run #4800), which is separate.

AI Usage

  • I have used Claude Code
  • I have used another model
  • I have not used AI

Pre-submission checklist

  • I have performed an AI review of my code
  • I have implemented and tested all related authorization policies.
  • I have updated the changelog.
  • I have ticked a box in "AI usage" in this PR

@github-project-automation github-project-automation Bot moved this to New Issues in Core May 30, 2026
stuartc added 3 commits May 30, 2026 20:31
The AFTER INSERT trigger on log_lines computed to_tsvector synchronously
inside the insert transaction (it never actually deferred anything) and
double-wrote every row via a self-UPDATE. It also lacked the
program_limit_exceeded guard dataclips got, so an oversized message
aborted the insert (rolling back the whole batch for insert_all).

Remove the trigger so inserts leave search_vector NULL, and backfill it
out-of-band via a new Oban worker (Lightning.LogLines.SearchVectorWorker)
on a dedicated search_indexing queue. Adds a guarded safe_to_tsvector SQL
function and a partial index (WHERE search_vector IS NULL) so draining
pending rows stays cheap. Covers both the single (run:log) and batch
(run:batch_logs) insert paths. Live log streaming is unaffected (push-based
via PubSub); only full-text log search lags slightly behind ingestion.
Follow-up hardening on the search_vector deferral, from a review pass:

- Fix the worker's snowball chain. Oban's default unique states include
  :executing and :completed, so a running snowball job matched itself when
  enqueueing its successor and the insert was silently deduped, breaking the
  chain after one hop. Restrict uniqueness to [:available, :scheduled].
- Make the pending-search index migration re-runnable: a failed CREATE INDEX
  CONCURRENTLY leaves an INVALID index that IF NOT EXISTS would skip and ATTACH
  would leave the parent invalid forever. Drop any invalid leftover first.
- Make safe_to_tsvector NULL-defensive and re-runnable (CREATE OR REPLACE,
  drop STRICT, COALESCE the doc) so it never returns NULL and leaves a row
  stuck in the pending set. Also the template for the dataclip fix.
- Add snowball uniqueness regression tests.
Trim hindsight/diff-narrating comments down to what's non-obvious, and
rewrite the SearchVectorWorker moduledoc to read as documentation of the
mechanism rather than a justification of the change.
@stuartc stuartc force-pushed the defer-log-lines-search-vector branch from f3f45af to cae40bc Compare May 30, 2026 18:32
@stuartc stuartc marked this pull request as ready for review May 30, 2026 18:32
@github-actions
Copy link
Copy Markdown

The single raw SQL call is parameterized with a module constant (@batch_size); migration interpolation uses part_num from a fixed 1..100 range. No user-controlled input flows into SQL.

Security Review ✅

  • S0 (project scoping): N/A — the new worker performs internal search_vector backfill on log_lines (no project-scoped reads/writes, no web entrypoint); the one Repo.query! (lib/lightning/log_lines/search_vector_worker.ex:82) is parameter-bound and the migration's only interpolation is a fixed 1..100 partition counter.
  • S1 (authorization): N/A — no new controller, LiveView event, or channel handler; the worker is only invoked by Oban cron/snowball self-enqueue.
  • S2 (audit trail): N/A — the only write updates a derived full-text index column on log_lines, not a config resource (workflows, credentials, settings, OAuth clients, etc.).

@taylordowns2000
Copy link
Copy Markdown
Member

@stuartc , fantastic find here. I noticed that the description seemed to reference inserting multiple log lines at once. Wanted to double check: batch log lines insert is turned off on app.openfn.org, right?

I think we tried it a few months back and when the crashes started occurring we switched it off. I believe Joe has been tracking a bug related to log line ordering on the bulk log line feature too.

(Just double checking that we're all on the same page here.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: New Issues

Development

Successfully merging this pull request may close these issues.

Run channel timeouts for run:log and step:complete events

2 participants