Findings
1) httpjson can silently lose events after a transient publish failure
Priority: P0 (most user-impacting)
Location
x-pack/filebeat/input/httpjson/request.go:819-837
x-pack/filebeat/input/httpjson/request.go:51-78, 235-239
Evidence
- In
publisher.handleEvent, publish failure returns immediately:
if err := p.pub.Publish(event, p.trCtx.cursorMap()); err != nil { ...; return } (request.go:827-829)
- Processing continues for later events in the same response, and successful later events call:
p.trCtx.updateLastEvent(msg) + p.trCtx.updateCursor() (request.go:835-836)
doRequest still returns nil at end (request.go:238), so interval completes as success.
What is wrong
A failure publishing one event does not fail the batch/interval. Later events can still advance cursor state, causing the failed earlier event to be skipped on next poll.
Why it matters
This is silent data loss under realistic transient output/pipeline backpressure failures.
Suggested fix direction
Treat publish failure as interval failure (propagate error), and prevent cursor advancement after first publish error in a response/batch.
Proposed failing test
x-pack/filebeat/input/httpjson/request_test.go::TestDoRequest_DoesNotAdvanceCursorAfterPublishError
- Mock a response with multiple events, force first
Publish to fail and second to succeed, assert cursor is not advanced past failed event and doRequest returns error/retry signal.
2) Streaming reconnect backoff accepts zero waits, enabling tight retry loops
Priority: P1
Location
x-pack/filebeat/input/streaming/config.go:177-183
x-pack/filebeat/input/streaming/websocket.go:456-457, 482-483
x-pack/filebeat/input/streaming/crowdstrike.go:184-199
Evidence
- Validation checks ordering only (
wait_min <= wait_max) but not positivity (config.go:181-182).
- Reconnect loops use computed duration directly in
time.Sleep(waitTime) / time.After(waitTime).
What is wrong
retry.wait_min: 0s and retry.wait_max: 0s are accepted; retries then execute with effectively no delay.
Why it matters
During endpoint outages this can create CPU/network/log storms and hammer remote services.
Suggested fix direction
Enforce wait_min > 0 and wait_max > 0 at validation; defensively clamp computed wait to a minimum positive duration before sleep.
Proposed failing test
x-pack/filebeat/input/streaming/config_test.go::TestRetryWaitMustBePositive
- Assert config validation fails when
wait_min <= 0 or wait_max <= 0.
3) WebSocket OAuth token-refresh error paths can hard-stop ingestion on transient failures
Priority: P1
Location
x-pack/filebeat/input/streaming/websocket.go:268-274
x-pack/filebeat/input/streaming/websocket.go:287-294
Evidence
- On token refresh fetch error: function returns immediately (
return err).
- On reconnect failure after refresh: function returns immediately (
return err).
What is wrong
A single transient OAuth/token endpoint or network failure during refresh permanently terminates FollowStream instead of retrying with backoff.
Why it matters
A brief auth or network blip can halt ingestion until external restart/orchestration recovers the input.
Suggested fix direction
Convert these refresh failures to retryable paths with bounded/infinite backoff; avoid immediate fatal return for transient refresh errors.
Proposed failing test
x-pack/filebeat/input/streaming/input_test.go::TestWebsocketTokenRefreshTransientFailureRetries
- Simulate one temporary token fetch/reconnect failure then success; assert stream remains alive and resumes.
4) Streaming processor can rewind cursor to startup value after publish errors across reconnects
Priority: P1
Location
x-pack/filebeat/input/streaming/websocket.go:343
x-pack/filebeat/input/streaming/crowdstrike.go:357
x-pack/filebeat/input/streaming/input.go:183-184, 317
Evidence
- Callers pass
s.cursor into processor each message (websocket.go:343, crowdstrike.go:357).
- Processor initializes fallback cursor from function arg (
goodCursor := cursor, input.go:184) and writes it back on publication errors (state["cursor"] = goodCursor, input.go:317).
s.cursor is startup cursor and is not updated inside follow loops.
What is wrong
On publish failure/disconnect, state fallback may be reset to stale startup cursor instead of latest known-good runtime cursor.
Why it matters
Reconnect may replay already-processed data repeatedly (state corruption/duplication) in real disconnect + transient publish-failure scenarios.
Suggested fix direction
Do not pass a separate stale cursor arg; derive fallback from current state["cursor"] or maintain/update follower cursor as processing succeeds.
Proposed failing test
x-pack/filebeat/input/streaming/input_test.go::TestProcess_DoesNotRewindCursorAfterPublishFailureAcrossReconnect
- Seed non-empty cursor progression, inject publish failure, assert returned state cursor never rewinds to initial startup cursor.
Inputs and error paths reviewed and found safe (for this high-severity sweep)
streaming response-shape handling for CrowdStrike stream: non-object JSON messages are explicitly skipped (x-pack/filebeat/input/streaming/crowdstrike.go:350-354).
httpjson/response processing: non-JSON/non-object payloads are degraded/skipped rather than published as normal events (x-pack/filebeat/input/httpjson/response.go:62-87).
- No additional confirmed high-severity malformed-response publication issue found in
x-pack/filebeat/input/cel during this pass.
Reproduction realism
All scenarios above are reproducible with realistic mock HTTP/WebSocket server behavior: temporary 5xx/auth failures, disconnects, and normal backpressure-like publish errors.
What is this? | From workflow: Sweeper: Streaming Input Error Path Resilience
Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.
Findings
1)
httpjsoncan silently lose events after a transient publish failurePriority: P0 (most user-impacting)
Location
x-pack/filebeat/input/httpjson/request.go:819-837x-pack/filebeat/input/httpjson/request.go:51-78, 235-239Evidence
publisher.handleEvent, publish failure returns immediately:if err := p.pub.Publish(event, p.trCtx.cursorMap()); err != nil { ...; return }(request.go:827-829)p.trCtx.updateLastEvent(msg)+p.trCtx.updateCursor()(request.go:835-836)doRequeststill returnsnilat end (request.go:238), so interval completes as success.What is wrong
A failure publishing one event does not fail the batch/interval. Later events can still advance cursor state, causing the failed earlier event to be skipped on next poll.
Why it matters
This is silent data loss under realistic transient output/pipeline backpressure failures.
Suggested fix direction
Treat publish failure as interval failure (propagate error), and prevent cursor advancement after first publish error in a response/batch.
Proposed failing test
x-pack/filebeat/input/httpjson/request_test.go::TestDoRequest_DoesNotAdvanceCursorAfterPublishErrorPublishto fail and second to succeed, assert cursor is not advanced past failed event anddoRequestreturns error/retry signal.2) Streaming reconnect backoff accepts zero waits, enabling tight retry loops
Priority: P1
Location
x-pack/filebeat/input/streaming/config.go:177-183x-pack/filebeat/input/streaming/websocket.go:456-457, 482-483x-pack/filebeat/input/streaming/crowdstrike.go:184-199Evidence
wait_min <= wait_max) but not positivity (config.go:181-182).time.Sleep(waitTime)/time.After(waitTime).What is wrong
retry.wait_min: 0sandretry.wait_max: 0sare accepted; retries then execute with effectively no delay.Why it matters
During endpoint outages this can create CPU/network/log storms and hammer remote services.
Suggested fix direction
Enforce
wait_min > 0andwait_max > 0at validation; defensively clamp computed wait to a minimum positive duration before sleep.Proposed failing test
x-pack/filebeat/input/streaming/config_test.go::TestRetryWaitMustBePositivewait_min <= 0orwait_max <= 0.3) WebSocket OAuth token-refresh error paths can hard-stop ingestion on transient failures
Priority: P1
Location
x-pack/filebeat/input/streaming/websocket.go:268-274x-pack/filebeat/input/streaming/websocket.go:287-294Evidence
return err).return err).What is wrong
A single transient OAuth/token endpoint or network failure during refresh permanently terminates
FollowStreaminstead of retrying with backoff.Why it matters
A brief auth or network blip can halt ingestion until external restart/orchestration recovers the input.
Suggested fix direction
Convert these refresh failures to retryable paths with bounded/infinite backoff; avoid immediate fatal return for transient refresh errors.
Proposed failing test
x-pack/filebeat/input/streaming/input_test.go::TestWebsocketTokenRefreshTransientFailureRetries4) Streaming processor can rewind cursor to startup value after publish errors across reconnects
Priority: P1
Location
x-pack/filebeat/input/streaming/websocket.go:343x-pack/filebeat/input/streaming/crowdstrike.go:357x-pack/filebeat/input/streaming/input.go:183-184, 317Evidence
s.cursorinto processor each message (websocket.go:343,crowdstrike.go:357).goodCursor := cursor,input.go:184) and writes it back on publication errors (state["cursor"] = goodCursor,input.go:317).s.cursoris startup cursor and is not updated inside follow loops.What is wrong
On publish failure/disconnect, state fallback may be reset to stale startup cursor instead of latest known-good runtime cursor.
Why it matters
Reconnect may replay already-processed data repeatedly (state corruption/duplication) in real disconnect + transient publish-failure scenarios.
Suggested fix direction
Do not pass a separate stale cursor arg; derive fallback from current
state["cursor"]or maintain/update follower cursor as processing succeeds.Proposed failing test
x-pack/filebeat/input/streaming/input_test.go::TestProcess_DoesNotRewindCursorAfterPublishFailureAcrossReconnectInputs and error paths reviewed and found safe (for this high-severity sweep)
streamingresponse-shape handling for CrowdStrike stream: non-object JSON messages are explicitly skipped (x-pack/filebeat/input/streaming/crowdstrike.go:350-354).httpjson/response processing: non-JSON/non-object payloads are degraded/skipped rather than published as normal events (x-pack/filebeat/input/httpjson/response.go:62-87).x-pack/filebeat/input/celduring this pass.Reproduction realism
All scenarios above are reproducible with realistic mock HTTP/WebSocket server behavior: temporary 5xx/auth failures, disconnects, and normal backpressure-like publish errors.
What is this? | From workflow: Sweeper: Streaming Input Error Path Resilience
Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.