feat: add prometheus metrics for frontline routing and proxy#5268
feat: add prometheus metrics for frontline routing and proxy#5268
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughRenamed existing middleware metrics to use the "unkey" namespace and added Prometheus instrumentation across the proxy and router services, including new metric declarations and runtime increments for proxy forwarding, backend responses, hop counts, routing decisions, errors, and durations. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Proxy as FrontlineProxy
participant Backend as UpstreamBackend
participant Metrics as Prometheus
Client->>Proxy: HTTP request
Proxy->>Proxy: start backendStart = now
Proxy->>Backend: forward request
alt backend success
Backend-->>Proxy: response (2xx/3xx/4xx/5xx)
Proxy->>Proxy: ModifyResponse observes duration
Proxy->>Metrics: proxyBackendDuration.Observe(destination, duration)
Proxy->>Metrics: proxyBackendResponseTotal.Inc(destination, source, status_class)
Proxy->>Metrics: proxyForwardTotal.Inc(destination, "none" or categorized)
Proxy-->>Client: proxied response
else backend error
Backend-->>Proxy: network/error
Proxy->>Proxy: ErrorHandler observes duration
Proxy->>Metrics: proxyBackendDuration.Observe(destination, duration)
Proxy->>Metrics: proxyBackendResponseTotal.Inc(destination, source, status_class)
Proxy->>Metrics: proxyForwardTotal.Inc(destination, categorized_error)
Proxy-->>Client: fault error
end
Proxy->>Metrics: proxyHopsTotal.Observe(hops) (if header present)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
3366b70 to
6270600
Compare
66f3e36 to
9adedb9
Compare
6270600 to
155ce87
Compare
9adedb9 to
cc3a71d
Compare
155ce87 to
61ced6a
Compare
61ced6a to
61a8527
Compare
61a8527 to
e2de48e
Compare
cc3a71d to
75cfe4e
Compare
e2de48e to
520e4c6
Compare
75cfe4e to
01dac21
Compare
01dac21 to
8ce64a8
Compare
520e4c6 to
6ddfc5a
Compare
8ce64a8 to
9204546
Compare
6ddfc5a to
71aaa96
Compare
b5e9192 to
ae5da5b
Compare
979c6ae to
f478540
Compare
f478540 to
1226d14
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
svc/frontline/services/router/lookup.go (1)
67-74:⚠️ Potential issue | 🟠 MajorDouble-counting:
instance_load_failedincremented here and inservice.go:47.When
getInstancesfails, this line increments the counter, then the caller inservice.goalso incrementsroutingErrorsTotal.WithLabelValues("instance_load_failed"). Remove one of the increments.Proposed fix: remove duplicate in lookup.go
if err != nil && !mysql.IsNotFound(err) { - routingErrorsTotal.WithLabelValues("instance_load_failed").Inc() return nil, fault.Wrap(err, fault.Code(codes.Frontline.Internal.ConfigLoadFailed.URN()), fault.Internal("error loading instances"), fault.Public("Failed to load instance configuration"), ) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@svc/frontline/services/router/lookup.go` around lines 67 - 74, The duplicate metric increment occurs in the error path inside lookup.go (the block that returns a wrapped fault when getInstances fails) where routingErrorsTotal.WithLabelValues("instance_load_failed").Inc() is called; remove that increment so the counter is only incremented by the caller in service.go (around service.go:47) and keep the existing fault.Wrap(...) return as-is to preserve error semantics.
🧹 Nitpick comments (1)
svc/frontline/middleware/observability.go (1)
24-27: Series rename needs rollout plan.These lines rename existing series. Any dashboard/alert/rule on
frontline_*goes dark on deploy. Ship consumer updates with this PR, or dual-publish one release.Also applies to: 34-38, 45-48
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@svc/frontline/middleware/observability.go` around lines 24 - 27, The change renames existing Prometheus series (Namespace/Subsystem/Name) which will break dashboards/alerts; instead emit both old and new series during rollout: keep the new metric (e.g., the collector using Name: "requests_total") and also register a second collector with the original namespace/subsystem/name (the previous frontline_* naming) so both series are published concurrently; apply the same dual-publish pattern to the other renamed metrics referenced around the other diffs (lines 34-38 and 45-48) by creating and registering duplicate collectors (e.g., requestsTotal and requestsTotalLegacy) or otherwise publishing the legacy metric until consumers are migrated, then remove the legacy collectors in a follow-up release.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@svc/frontline/services/proxy/forward.go`:
- Around line 147-149: The metric proxyBackendDuration is measured around
proxy.ServeHTTP which blocks until the response body is fully copied to the
client, so it incorrectly includes client streaming time; remove the observe
call around ServeHTTP and instead record backend latency inside the reverse
proxy callbacks by creating an observeBackend() helper and calling it at the
start of both the ReverseProxy ModifyResponse and ErrorHandler handlers
(referencing proxy.ServeHTTP, proxyBackendDuration, ModifyResponse,
ErrorHandler, and observeBackend) so the duration is observed when the backend
response or error is received rather than after client copy completes.
In `@svc/frontline/services/proxy/service.go`:
- Around line 144-160: The proxy currently calls proxyHopsTotal.Observe using
HeaderFrontlineHops in the session before makeRegionDirector increments the hop
count, causing off-by-one/missing metrics; update the code so the metric is
recorded after the hop increment instead—either move the Observe call out of the
session handler and into the director code path where makeRegionDirector (or the
director method that performs the hop increment) increments the hop header, or
have the director emit proxyHopsTotal.Observe with the new incremented value;
reference HeaderFrontlineHops, proxyHopsTotal.Observe, makeRegionDirector, and
the director hop-increment logic when making the change.
In `@svc/frontline/services/router/service.go`:
- Around line 39-43: The routingErrorsTotal metric is being incremented with
label "config_not_found" for all errors returned by lookupByHostname, but
lookupByHostname can return different fault codes from findRoute
(ConfigLoadFailed, ConfigNotFound) and sentinel loading (ConfigLoadFailed);
update the error handling so the metric label matches the actual failure: either
(A) have lookupByHostname return a typed error or fault code and switch on that
in the caller to call
routingErrorsTotal.WithLabelValues("<correct_code>").Inc(), or (B) move the
metric increment into the lower-level functions (findRoute and sentinel loader)
where the concrete error type is known and increment with the appropriate label
there; reference lookupByHostname, findRoute and routingErrorsTotal when making
the change.
- Around line 45-49: The metric
routingErrorsTotal.WithLabelValues("instance_load_failed").Inc() is being
double-counted: it's incremented both at the call site in service.go (around
s.getInstances(ctx, route.DeploymentID)) and inside the callee getInstances
(lookup.go). Remove the duplicate increment from the callee: delete the
routingErrorsTotal.WithLabelValues("instance_load_failed").Inc() inside the
getInstances implementation in lookup.go so the single increment remains at the
s.getInstances call site (or alternatively remove the call-site increment and
keep the one in getInstances if you prefer), ensuring only one location
increments that metric.
---
Outside diff comments:
In `@svc/frontline/services/router/lookup.go`:
- Around line 67-74: The duplicate metric increment occurs in the error path
inside lookup.go (the block that returns a wrapped fault when getInstances
fails) where routingErrorsTotal.WithLabelValues("instance_load_failed").Inc() is
called; remove that increment so the counter is only incremented by the caller
in service.go (around service.go:47) and keep the existing fault.Wrap(...)
return as-is to preserve error semantics.
---
Nitpick comments:
In `@svc/frontline/middleware/observability.go`:
- Around line 24-27: The change renames existing Prometheus series
(Namespace/Subsystem/Name) which will break dashboards/alerts; instead emit both
old and new series during rollout: keep the new metric (e.g., the collector
using Name: "requests_total") and also register a second collector with the
original namespace/subsystem/name (the previous frontline_* naming) so both
series are published concurrently; apply the same dual-publish pattern to the
other renamed metrics referenced around the other diffs (lines 34-38 and 45-48)
by creating and registering duplicate collectors (e.g., requestsTotal and
requestsTotalLegacy) or otherwise publishing the legacy metric until consumers
are migrated, then remove the legacy collectors in a follow-up release.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 7e0b0192-dbfd-48c4-a461-2e7ee6ae100a
📒 Files selected for processing (10)
svc/frontline/middleware/observability.gosvc/frontline/services/proxy/BUILD.bazelsvc/frontline/services/proxy/forward.gosvc/frontline/services/proxy/metrics.gosvc/frontline/services/proxy/service.gosvc/frontline/services/router/BUILD.bazelsvc/frontline/services/router/lookup.gosvc/frontline/services/router/metrics.gosvc/frontline/services/router/routing.gosvc/frontline/services/router/service.go
Adds metrics to detect routing and proxy issues: Router: - decisions_total: local vs remote routing split by target region - errors_total: no_running_instances, no_sentinels_for_instances, config_not_found - duration_seconds: route decision latency (cache + DB) - running_instances: distribution of available instances per decision Proxy: - forward_total: success/failure by target and error type (timeout, conn_refused, etc.) - backend_duration_seconds: backend response latency by target - backend_response_total: HTTP status classes (2xx/4xx/5xx) from backends - hops: cross-region hop count distribution
1226d14 to
a759368
Compare
a759368 to
ca3eab6
Compare
Merge activity
|

What does this PR do?
Some initial frontline metrics that we can tweak change remove over time.
Fixes # (issue)
Type of change
How should this be tested?
/metricsendpointChecklist
Required
pnpm buildpnpm fmtmake fmton/godirectoryconsole.logsgit pull origin mainAppreciated