feat(outbound): add load balancing and circuit breaking proto surface#556
feat(outbound): add load balancing and circuit breaking proto surface#556unleashed wants to merge 6 commits into
Conversation
The tonic-prost-build configure call was missing an explicit build_transport(true). While this is the default value, being explicit keeps the builder chain consistent with build_client and build_server, and prevents surprises if the default changes. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Replace the single-field oneof wrapper with direct optional fields, allowing consecutive-failure and success-rate policies to coexist. The consecutive_failures field retains field number 1, so the wire encoding is identical to the old oneof layout and existing proxies continue to work without changes. Add a SuccessRate nested message at field 2 with threshold, decay, and min_requests parameters. When absent, success-rate accrual is disabled. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Introduce a LoadBiasConfig message for 429-aware load balancing. When set on Http1, Http2, or Grpc protocol variants, the proxy injects artificial latency penalties on rate-limited endpoints so the P2C balancer prefers healthier alternatives. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Introduce a RetryAfterConfig message for Retry-After header handling. When set on Http1, Http2, or Grpc protocol variants, the proxy honors Retry-After headers from 429 responses and clamps durations to the configured maximum. Added as field 4 on each HTTP protocol variant. The single max_duration field caps the Retry-After value the proxy will honor, falling back to a built-in default when absent. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Add pool-level ejection protection to the P2C load balancer configuration. When set this prevents circuit breakers from ejecting all endpoints in a load-balancing pool by enforcing a minimum number of ready endpoints. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Full regeneration from updated outbound.proto after the FailureAccrual restructure, the additions of LoadBiasConfig, RetryAfterConfig and EjectionConfig messages, and load_bias, retry_after and ejection fields. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
There was a problem hiding this comment.
i've left two comments here, about the same general concern: i think we should avoid the potential for compatibility issues and instead deprecate the existing kind field, without removing it.
in the long run we should find a way to version the protocol, but in the short term we can rely on some small compatibility shims to inspect a FailureAccrual object and determine the proper course of action.
i imagine we can consider both fields existing an error, and perhaps have new control planes log a deprecation warning when an outbound proxy emits the old kind field.
| // Configures failure accrual policies for circuit breaking. | ||
| // Setting a numeric policy field to zero disables that policy. | ||
| message FailureAccrual { | ||
| message ConsecutiveFailures { | ||
| // Maximum consecutive failures before the circuit trips. | ||
| // Set to 0 to disable (an unset field has the same effect). | ||
| uint32 max_failures = 1; | ||
| // Must be set. Controls the ejection duration before probe requests | ||
| // are allowed after any policy (not just CF) trips the circuit. |
There was a problem hiding this comment.
thanks for introducing extra documentation to fields like this. i really appreciate you taking the time to make improvements along the way.
cratelyn
left a comment
There was a problem hiding this comment.
✔️ this looks good to me, thanks for talking through this change!
Add the protobuf API surface for 429-aware load balancing and circuit breaking
in the outbound proxy stack. This is the first of a series of PRs across
linkerd2-proxy-api, linkerd2-proxy, and linkerd2 that introduce opt-in load
balancing features to the P2C balancer.
Most changes are additive and every new feature requires explicit opt-in via
annotations. Default proxy behavior is unchanged.
Main changes:
FailureAccrualrestructure. Replaces theoneof kindwrapper with a flatmessage to support multiple accrual policies simultaneously. A new
SuccessRatepolicy is added alongside the existing
ConsecutiveFailures. The sharedExponentialBackoffremains insideConsecutiveFailuresfor wire compatibility.Set
max_failures = 0to disable the consecutive-failure policy whileretaining the shared backoff.
LoadBiasConfig. Configures 429-aware load biasing (enabled flag, penaltyduration, penalty decay) on Http1, Http2, and Grpc protocol messages.
RetryAfterConfig. Configures handling ofRetry-Afterheaders (HTTP429/503) and
grpc-retry-pushback-mstrailers (gRPC RESOURCE_EXHAUSTED),with a configurable max duration cap.
EjectionConfig. Pool-level ejection protection that prevents circuitbreakers from ejecting all endpoints in a load-balancing pool by enforcing
a minimum number of ready endpoints.
Backwards compatibility
We make new proto fields use fresh field numbers (no renumbering), and all
new fields are optional (old consumers just ignore them). For the message
we modify, we make sure it is still backwards compatible (see below).
Wire compatibility (
FailureAccrual)The
oneofremoval is wire-safe: protobuf encodes oneof fields identically toregular fields on the wire. The
consecutive_failuresfield retains tag 1 withwire type 2 (LEN). You can easily verify this yourself:
Backoff placement
The
ExponentialBackoff backofffield lives insideConsecutiveFailuresrather than on
FailureAccrualdirectly. This preserves the original fieldlayout for wire compatibility (backoff was always at tag 2 inside
ConsecutiveFailures). The backoff is shared across all accrual policies:to use only the success-rate policy, we set
consecutive_failureswithmax_failures = 0to disable the consecutive-failure check while stillsetting the shared backoff configuration.
Cross-Repo Dependencies
linkerd2-proxywill consume the new proto types for client policy parsing, circuit breaking, load biasing, and ejection coordination (to-be-pushed). These will refer to either a git commit id in this repo with this PR or to a crate version after this lands.linkerd2control plane will consume this to populate the new fields based on annotations.