feat: add prometheus metrics for frontline routing and proxy by Flo4604 · Pull Request #5268 · unkeyed/unkey

Flo4604 · 2026-03-09T19:03:13Z

What does this PR do?

Some initial frontline metrics that we can tweak change remove over time.

Fixes # (issue)

Type of change

Bug fix (non-breaking change which fixes an issue)
Chore (refactoring code, technical debt, workflow improvements)
Enhancement (small improvements)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How should this be tested?

Deploy to a test environment and verify metrics are being exported at /metrics endpoint
Generate traffic through the proxy and confirm metrics increment correctly
Test various error conditions (backend timeouts, connection failures) to verify error categorization
Verify routing metrics show correct local vs remote decisions
Check that hop count metrics are recorded for cross-region requests

Checklist

Required

Appreciated

If a UI change was made: Added a screen recording or screenshots to this PR
Updated the Unkey Docs if changes were necessary

vercel · 2026-03-09T19:03:16Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
dashboard	Ready	Preview, Comment	Mar 24, 2026 0:24am

Flo4604 · 2026-03-09T19:03:27Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

coderabbitai · 2026-03-09T19:03:37Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b3d5c856-fd45-4236-807b-457defb5e5df

📥 Commits

Reviewing files that changed from the base of the PR and between a759368 and ca3eab6.

📒 Files selected for processing (5)

svc/frontline/services/proxy/forward.go
svc/frontline/services/proxy/metrics.go
svc/frontline/services/router/lookup.go
svc/frontline/services/router/metrics.go
svc/frontline/services/router/service.go

📝 Walkthrough

Walkthrough

Renamed existing middleware metrics to use the "unkey" namespace and added Prometheus instrumentation across the proxy and router services, including new metric declarations and runtime increments for proxy forwarding, backend responses, hop counts, routing decisions, errors, and durations.

Changes

Cohort / File(s)	Summary
Middleware metrics rename `svc/frontline/middleware/observability.go`	Renamed three Prometheus metrics to use Namespace:`"unkey"`, Subsystem:`"frontline"` with new Names (`requests_total`, `request_duration_seconds`, `active_requests`). Help text, buckets, labels, and update sites unchanged.
Proxy: build & metrics `svc/frontline/services/proxy/BUILD.bazel`, `svc/frontline/services/proxy/metrics.go`	Added `metrics.go` and Prometheus promauto deps. Declares `proxyForwardTotal` (destination,error), `proxyBackendDuration` (destination), `proxyHopsTotal` (hop buckets), and `proxyBackendResponseTotal` (destination,source,status_class).
Proxy: instrumentation & logic `svc/frontline/services/proxy/forward.go`, `svc/frontline/services/proxy/service.go`	Replaced `logTarget` with `destination`. Instrumented ReverseProxy path: record backend start/duration in ModifyResponse/ErrorHandler; increment `proxyBackendResponseTotal` with `source` (from `X-Unkey-Error-Source` defaulting to `upstream`/`sentinel`), increment `proxyForwardTotal` using categorized error types (via new `categorizeProxyErrorType`) and `statusClass`; always observe parsed hop counts via `proxyHopsTotal` and reject only when hops >= max.
Router: build & metrics `svc/frontline/services/router/BUILD.bazel`, `svc/frontline/services/router/metrics.go`	Added `metrics.go` and Prometheus deps. Declares `routingDecisionsTotal` (decision,target_region), `routingErrorsTotal` (reason), and `routingDuration` (histogram).
Router: instrumentation `svc/frontline/services/router/lookup.go`, `svc/frontline/services/router/routing.go`, `svc/frontline/services/router/service.go`	Added metric increments on error paths: `routingErrorsTotal` in `lookup` and `selectSentinel` (multiple distinct reasons). `Route()` now measures duration, records routing decisions (`local_sentinel` or `remote_region`) and increments error metric on instance-load failure.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Proxy as FrontlineProxy
  participant Backend as UpstreamBackend
  participant Metrics as Prometheus

  Client->>Proxy: HTTP request
  Proxy->>Proxy: start backendStart = now
  Proxy->>Backend: forward request
  alt backend success
    Backend-->>Proxy: response (2xx/3xx/4xx/5xx)
    Proxy->>Proxy: ModifyResponse observes duration
    Proxy->>Metrics: proxyBackendDuration.Observe(destination, duration)
    Proxy->>Metrics: proxyBackendResponseTotal.Inc(destination, source, status_class)
    Proxy->>Metrics: proxyForwardTotal.Inc(destination, "none" or categorized)
    Proxy-->>Client: proxied response
  else backend error
    Backend-->>Proxy: network/error
    Proxy->>Proxy: ErrorHandler observes duration
    Proxy->>Metrics: proxyBackendDuration.Observe(destination, duration)
    Proxy->>Metrics: proxyBackendResponseTotal.Inc(destination, source, status_class)
    Proxy->>Metrics: proxyForwardTotal.Inc(destination, categorized_error)
    Proxy-->>Client: fault error
  end
  Proxy->>Metrics: proxyHopsTotal.Observe(hops) (if header present)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The description includes the what/why summary, type selection, and detailed test instructions, but the 'Fixes' issue reference is not completed and none of the required checklist items are marked as done.	Complete the 'Fixes # (issue)' reference with an actual issue number, or remove the placeholder if no issue exists. Consider marking relevant checklist items (e.g., formatting, code review) once completed.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding Prometheus metrics for frontline routing and proxy components.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch frontline-prom-metrics

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

svc/frontline/services/router/lookup.go (1)
67-74: ⚠️ Potential issue | 🟠 Major

Double-counting: instance_load_failed incremented here and in service.go:47.

When getInstances fails, this line increments the counter, then the caller in service.go also increments routingErrorsTotal.WithLabelValues("instance_load_failed"). Remove one of the increments.
Proposed fix: remove duplicate in lookup.go
 	if err != nil && !mysql.IsNotFound(err) {
-		routingErrorsTotal.WithLabelValues("instance_load_failed").Inc()
 		return nil, fault.Wrap(err,
 			fault.Code(codes.Frontline.Internal.ConfigLoadFailed.URN()),
 			fault.Internal("error loading instances"),
 			fault.Public("Failed to load instance configuration"),
 		)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@svc/frontline/services/router/lookup.go` around lines 67 - 74, The duplicate
metric increment occurs in the error path inside lookup.go (the block that
returns a wrapped fault when getInstances fails) where
routingErrorsTotal.WithLabelValues("instance_load_failed").Inc() is called;
remove that increment so the counter is only incremented by the caller in
service.go (around service.go:47) and keep the existing fault.Wrap(...) return
as-is to preserve error semantics.

🧹 Nitpick comments (1)

svc/frontline/middleware/observability.go (1)

24-27: Series rename needs rollout plan.

These lines rename existing series. Any dashboard/alert/rule on frontline_* goes dark on deploy. Ship consumer updates with this PR, or dual-publish one release.

Also applies to: 34-38, 45-48

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@svc/frontline/middleware/observability.go` around lines 24 - 27, The change
renames existing Prometheus series (Namespace/Subsystem/Name) which will break
dashboards/alerts; instead emit both old and new series during rollout: keep the
new metric (e.g., the collector using Name: "requests_total") and also register
a second collector with the original namespace/subsystem/name (the previous
frontline_* naming) so both series are published concurrently; apply the same
dual-publish pattern to the other renamed metrics referenced around the other
diffs (lines 34-38 and 45-48) by creating and registering duplicate collectors
(e.g., requestsTotal and requestsTotalLegacy) or otherwise publishing the legacy
metric until consumers are migrated, then remove the legacy collectors in a
follow-up release.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@svc/frontline/services/proxy/forward.go`:
- Around line 147-149: The metric proxyBackendDuration is measured around
proxy.ServeHTTP which blocks until the response body is fully copied to the
client, so it incorrectly includes client streaming time; remove the observe
call around ServeHTTP and instead record backend latency inside the reverse
proxy callbacks by creating an observeBackend() helper and calling it at the
start of both the ReverseProxy ModifyResponse and ErrorHandler handlers
(referencing proxy.ServeHTTP, proxyBackendDuration, ModifyResponse,
ErrorHandler, and observeBackend) so the duration is observed when the backend
response or error is received rather than after client copy completes.

In `@svc/frontline/services/proxy/service.go`:
- Around line 144-160: The proxy currently calls proxyHopsTotal.Observe using
HeaderFrontlineHops in the session before makeRegionDirector increments the hop
count, causing off-by-one/missing metrics; update the code so the metric is
recorded after the hop increment instead—either move the Observe call out of the
session handler and into the director code path where makeRegionDirector (or the
director method that performs the hop increment) increments the hop header, or
have the director emit proxyHopsTotal.Observe with the new incremented value;
reference HeaderFrontlineHops, proxyHopsTotal.Observe, makeRegionDirector, and
the director hop-increment logic when making the change.

In `@svc/frontline/services/router/service.go`:
- Around line 39-43: The routingErrorsTotal metric is being incremented with
label "config_not_found" for all errors returned by lookupByHostname, but
lookupByHostname can return different fault codes from findRoute
(ConfigLoadFailed, ConfigNotFound) and sentinel loading (ConfigLoadFailed);
update the error handling so the metric label matches the actual failure: either
(A) have lookupByHostname return a typed error or fault code and switch on that
in the caller to call
routingErrorsTotal.WithLabelValues("<correct_code>").Inc(), or (B) move the
metric increment into the lower-level functions (findRoute and sentinel loader)
where the concrete error type is known and increment with the appropriate label
there; reference lookupByHostname, findRoute and routingErrorsTotal when making
the change.
- Around line 45-49: The metric
routingErrorsTotal.WithLabelValues("instance_load_failed").Inc() is being
double-counted: it's incremented both at the call site in service.go (around
s.getInstances(ctx, route.DeploymentID)) and inside the callee getInstances
(lookup.go). Remove the duplicate increment from the callee: delete the
routingErrorsTotal.WithLabelValues("instance_load_failed").Inc() inside the
getInstances implementation in lookup.go so the single increment remains at the
s.getInstances call site (or alternatively remove the call-site increment and
keep the one in getInstances if you prefer), ensuring only one location
increments that metric.

---

Outside diff comments:
In `@svc/frontline/services/router/lookup.go`:
- Around line 67-74: The duplicate metric increment occurs in the error path
inside lookup.go (the block that returns a wrapped fault when getInstances
fails) where routingErrorsTotal.WithLabelValues("instance_load_failed").Inc() is
called; remove that increment so the counter is only incremented by the caller
in service.go (around service.go:47) and keep the existing fault.Wrap(...)
return as-is to preserve error semantics.

---

Nitpick comments:
In `@svc/frontline/middleware/observability.go`:
- Around line 24-27: The change renames existing Prometheus series
(Namespace/Subsystem/Name) which will break dashboards/alerts; instead emit both
old and new series during rollout: keep the new metric (e.g., the collector
using Name: "requests_total") and also register a second collector with the
original namespace/subsystem/name (the previous frontline_* naming) so both
series are published concurrently; apply the same dual-publish pattern to the
other renamed metrics referenced around the other diffs (lines 34-38 and 45-48)
by creating and registering duplicate collectors (e.g., requestsTotal and
requestsTotalLegacy) or otherwise publishing the legacy metric until consumers
are migrated, then remove the legacy collectors in a follow-up release.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7e0b0192-dbfd-48c4-a461-2e7ee6ae100a

📥 Commits

Reviewing files that changed from the base of the PR and between b55f66d and 1226d14.

📒 Files selected for processing (10)

svc/frontline/middleware/observability.go
svc/frontline/services/proxy/BUILD.bazel
svc/frontline/services/proxy/forward.go
svc/frontline/services/proxy/metrics.go
svc/frontline/services/proxy/service.go
svc/frontline/services/router/BUILD.bazel
svc/frontline/services/router/lookup.go
svc/frontline/services/router/metrics.go
svc/frontline/services/router/routing.go
svc/frontline/services/router/service.go

Adds metrics to detect routing and proxy issues: Router: - decisions_total: local vs remote routing split by target region - errors_total: no_running_instances, no_sentinels_for_instances, config_not_found - duration_seconds: route decision latency (cache + DB) - running_instances: distribution of available instances per decision Proxy: - forward_total: success/failure by target and error type (timeout, conn_refused, etc.) - backend_duration_seconds: backend response latency by target - backend_response_total: HTTP status classes (2xx/4xx/5xx) from backends - hops: cross-region hop count distribution

ogzhanolguncu

rabbit comments might worth checking

Flo4604 · 2026-03-25T17:43:41Z

Merge activity

Mar 25, 5:43 PM UTC: A user started a stack merge that includes this pull request via Graphite.

Flo4604 mentioned this pull request Mar 9, 2026

fix: instance-aware frontline routing to prevent 503s in regions without instances #5267

Merged

19 tasks

vercel Bot deployed to Preview March 9, 2026 19:05 View deployment

Flo4604 force-pushed the frontline-prom-metrics branch from 3366b70 to 6270600 Compare March 9, 2026 19:41

vercel Bot deployed to Preview March 9, 2026 19:44 View deployment

Flo4604 force-pushed the instance-aware-routing branch from 66f3e36 to 9adedb9 Compare March 10, 2026 08:34

Flo4604 force-pushed the frontline-prom-metrics branch from 6270600 to 155ce87 Compare March 10, 2026 08:34

vercel Bot deployed to Preview March 10, 2026 08:36 View deployment

Flo4604 force-pushed the instance-aware-routing branch from 9adedb9 to cc3a71d Compare March 10, 2026 17:58

Flo4604 force-pushed the frontline-prom-metrics branch from 155ce87 to 61ced6a Compare March 10, 2026 17:58

vercel Bot deployed to Preview March 10, 2026 18:01 View deployment

Flo4604 force-pushed the frontline-prom-metrics branch from 61ced6a to 61a8527 Compare March 11, 2026 12:18

vercel Bot deployed to Preview March 11, 2026 12:21 View deployment

Flo4604 mentioned this pull request Mar 11, 2026

feat: add sentinel prom metrics #5283

Merged

19 tasks

Flo4604 force-pushed the frontline-prom-metrics branch from 61a8527 to e2de48e Compare March 12, 2026 09:53

Flo4604 force-pushed the instance-aware-routing branch from cc3a71d to 75cfe4e Compare March 12, 2026 09:53

vercel Bot deployed to Preview March 12, 2026 09:56 View deployment

Flo4604 force-pushed the frontline-prom-metrics branch from e2de48e to 520e4c6 Compare March 12, 2026 17:14

Flo4604 force-pushed the instance-aware-routing branch from 75cfe4e to 01dac21 Compare March 12, 2026 17:14

vercel Bot deployed to Preview March 12, 2026 17:17 View deployment

chronark reviewed Mar 12, 2026

View reviewed changes

Comment thread svc/frontline/services/proxy/metrics.go Outdated

chronark approved these changes Mar 12, 2026

View reviewed changes

Flo4604 force-pushed the instance-aware-routing branch from 01dac21 to 8ce64a8 Compare March 13, 2026 14:34

Flo4604 force-pushed the frontline-prom-metrics branch from 520e4c6 to 6ddfc5a Compare March 13, 2026 14:34

vercel Bot deployed to Preview March 13, 2026 14:37 View deployment

Flo4604 force-pushed the instance-aware-routing branch from 8ce64a8 to 9204546 Compare March 16, 2026 11:31

Flo4604 force-pushed the frontline-prom-metrics branch from 6ddfc5a to 71aaa96 Compare March 16, 2026 11:31

vercel Bot deployed to Preview March 16, 2026 11:33 View deployment

vercel Bot deployed to Preview March 23, 2026 12:45 View deployment

Flo4604 force-pushed the frontline-prom-metrics branch from b5e9192 to ae5da5b Compare March 23, 2026 12:59

vercel Bot deployed to Preview March 23, 2026 13:00 View deployment

chronark reviewed Mar 23, 2026

View reviewed changes

Comment thread svc/frontline/services/proxy/metrics.go Outdated

chronark reviewed Mar 23, 2026

View reviewed changes

Comment thread svc/frontline/services/proxy/metrics.go

chronark reviewed Mar 23, 2026

View reviewed changes

Comment thread svc/frontline/services/proxy/metrics.go

chronark approved these changes Mar 23, 2026

View reviewed changes

vercel Bot deployed to Preview March 23, 2026 13:43 View deployment

Flo4604 force-pushed the frontline-prom-metrics branch from 979c6ae to f478540 Compare March 23, 2026 13:44

vercel Bot deployed to Preview March 23, 2026 13:46 View deployment

Base automatically changed from instance-aware-routing to main March 23, 2026 16:51

Flo4604 force-pushed the frontline-prom-metrics branch from f478540 to 1226d14 Compare March 24, 2026 09:38

vercel Bot deployed to Preview March 24, 2026 09:40 View deployment

coderabbitai Bot reviewed Mar 24, 2026

View reviewed changes

Comment thread svc/frontline/services/proxy/forward.go Outdated

Comment thread svc/frontline/services/proxy/service.go

Comment thread svc/frontline/services/router/service.go

Comment thread svc/frontline/services/router/service.go

Flo4604 added 2 commits March 24, 2026 12:41

rename to destination

9c213f0

ogzhanolguncu approved these changes Mar 24, 2026

View reviewed changes

Flo4604 force-pushed the frontline-prom-metrics branch from 1226d14 to a759368 Compare March 24, 2026 12:08

vercel Bot deployed to Preview March 24, 2026 12:09 View deployment

Flo4604 enabled auto-merge March 24, 2026 12:18

Flo4604 disabled auto-merge March 24, 2026 12:21

adjust prom things

ca3eab6

Flo4604 force-pushed the frontline-prom-metrics branch from a759368 to ca3eab6 Compare March 24, 2026 12:22

Flo4604 enabled auto-merge March 24, 2026 12:23

vercel Bot deployed to Preview March 24, 2026 12:24 View deployment

Flo4604 added this pull request to the merge queue Mar 24, 2026

Merged via the queue into main with commit f276378 Mar 24, 2026
9 of 10 checks passed

Flo4604 deleted the frontline-prom-metrics branch March 24, 2026 12:31

Conversation

Flo4604 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of change

How should this be tested?

Checklist

Required

Appreciated

Uh oh!

vercel Bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flo4604 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogzhanolguncu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Flo4604 commented Mar 25, 2026

Merge activity

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Flo4604 commented Mar 9, 2026 •

edited

Loading

vercel Bot commented Mar 9, 2026 •

edited

Loading

Flo4604 commented Mar 9, 2026 •

edited

Loading

coderabbitai Bot commented Mar 9, 2026 •

edited

Loading

ogzhanolguncu left a comment •

edited

Loading