Skip to content

[streaming-input-resilience] High-severity resilience gaps in Filebeat streaming/httpjson error paths #49612

@github-actions

Description

@github-actions

Findings

1) httpjson can silently lose events after a transient publish failure

Priority: P0 (most user-impacting)

Location

  • x-pack/filebeat/input/httpjson/request.go:819-837
  • x-pack/filebeat/input/httpjson/request.go:51-78, 235-239

Evidence

  • In publisher.handleEvent, publish failure returns immediately:
    • if err := p.pub.Publish(event, p.trCtx.cursorMap()); err != nil { ...; return } (request.go:827-829)
  • Processing continues for later events in the same response, and successful later events call:
    • p.trCtx.updateLastEvent(msg) + p.trCtx.updateCursor() (request.go:835-836)
  • doRequest still returns nil at end (request.go:238), so interval completes as success.

What is wrong
A failure publishing one event does not fail the batch/interval. Later events can still advance cursor state, causing the failed earlier event to be skipped on next poll.

Why it matters
This is silent data loss under realistic transient output/pipeline backpressure failures.

Suggested fix direction
Treat publish failure as interval failure (propagate error), and prevent cursor advancement after first publish error in a response/batch.

Proposed failing test

  • x-pack/filebeat/input/httpjson/request_test.go::TestDoRequest_DoesNotAdvanceCursorAfterPublishError
  • Mock a response with multiple events, force first Publish to fail and second to succeed, assert cursor is not advanced past failed event and doRequest returns error/retry signal.

2) Streaming reconnect backoff accepts zero waits, enabling tight retry loops

Priority: P1

Location

  • x-pack/filebeat/input/streaming/config.go:177-183
  • x-pack/filebeat/input/streaming/websocket.go:456-457, 482-483
  • x-pack/filebeat/input/streaming/crowdstrike.go:184-199

Evidence

  • Validation checks ordering only (wait_min <= wait_max) but not positivity (config.go:181-182).
  • Reconnect loops use computed duration directly in time.Sleep(waitTime) / time.After(waitTime).

What is wrong
retry.wait_min: 0s and retry.wait_max: 0s are accepted; retries then execute with effectively no delay.

Why it matters
During endpoint outages this can create CPU/network/log storms and hammer remote services.

Suggested fix direction
Enforce wait_min > 0 and wait_max > 0 at validation; defensively clamp computed wait to a minimum positive duration before sleep.

Proposed failing test

  • x-pack/filebeat/input/streaming/config_test.go::TestRetryWaitMustBePositive
  • Assert config validation fails when wait_min <= 0 or wait_max <= 0.

3) WebSocket OAuth token-refresh error paths can hard-stop ingestion on transient failures

Priority: P1

Location

  • x-pack/filebeat/input/streaming/websocket.go:268-274
  • x-pack/filebeat/input/streaming/websocket.go:287-294

Evidence

  • On token refresh fetch error: function returns immediately (return err).
  • On reconnect failure after refresh: function returns immediately (return err).

What is wrong
A single transient OAuth/token endpoint or network failure during refresh permanently terminates FollowStream instead of retrying with backoff.

Why it matters
A brief auth or network blip can halt ingestion until external restart/orchestration recovers the input.

Suggested fix direction
Convert these refresh failures to retryable paths with bounded/infinite backoff; avoid immediate fatal return for transient refresh errors.

Proposed failing test

  • x-pack/filebeat/input/streaming/input_test.go::TestWebsocketTokenRefreshTransientFailureRetries
  • Simulate one temporary token fetch/reconnect failure then success; assert stream remains alive and resumes.

4) Streaming processor can rewind cursor to startup value after publish errors across reconnects

Priority: P1

Location

  • x-pack/filebeat/input/streaming/websocket.go:343
  • x-pack/filebeat/input/streaming/crowdstrike.go:357
  • x-pack/filebeat/input/streaming/input.go:183-184, 317

Evidence

  • Callers pass s.cursor into processor each message (websocket.go:343, crowdstrike.go:357).
  • Processor initializes fallback cursor from function arg (goodCursor := cursor, input.go:184) and writes it back on publication errors (state["cursor"] = goodCursor, input.go:317).
  • s.cursor is startup cursor and is not updated inside follow loops.

What is wrong
On publish failure/disconnect, state fallback may be reset to stale startup cursor instead of latest known-good runtime cursor.

Why it matters
Reconnect may replay already-processed data repeatedly (state corruption/duplication) in real disconnect + transient publish-failure scenarios.

Suggested fix direction
Do not pass a separate stale cursor arg; derive fallback from current state["cursor"] or maintain/update follower cursor as processing succeeds.

Proposed failing test

  • x-pack/filebeat/input/streaming/input_test.go::TestProcess_DoesNotRewindCursorAfterPublishFailureAcrossReconnect
  • Seed non-empty cursor progression, inject publish failure, assert returned state cursor never rewinds to initial startup cursor.

Inputs and error paths reviewed and found safe (for this high-severity sweep)

  • streaming response-shape handling for CrowdStrike stream: non-object JSON messages are explicitly skipped (x-pack/filebeat/input/streaming/crowdstrike.go:350-354).
  • httpjson/response processing: non-JSON/non-object payloads are degraded/skipped rather than published as normal events (x-pack/filebeat/input/httpjson/response.go:62-87).
  • No additional confirmed high-severity malformed-response publication issue found in x-pack/filebeat/input/cel during this pass.

Reproduction realism

All scenarios above are reproducible with realistic mock HTTP/WebSocket server behavior: temporary 5xx/auth failures, disconnects, and normal backpressure-like publish errors.


What is this? | From workflow: Sweeper: Streaming Input Error Path Resilience

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

  • expires on Mar 30, 2026, 6:30 PM UTC

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs_teamIndicates that the issue/PR needs a Team:* label

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions