Skip to content

feat: add prometheus metrics for frontline routing and proxy#5268

Merged
Flo4604 merged 3 commits intomainfrom
frontline-prom-metrics
Mar 24, 2026
Merged

feat: add prometheus metrics for frontline routing and proxy#5268
Flo4604 merged 3 commits intomainfrom
frontline-prom-metrics

Conversation

@Flo4604
Copy link
Copy Markdown
Member

@Flo4604 Flo4604 commented Mar 9, 2026

What does this PR do?

Some initial frontline metrics that we can tweak change remove over time.

Fixes # (issue)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Chore (refactoring code, technical debt, workflow improvements)
  • Enhancement (small improvements)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How should this be tested?

  • Deploy to a test environment and verify metrics are being exported at /metrics endpoint
  • Generate traffic through the proxy and confirm metrics increment correctly
  • Test various error conditions (backend timeouts, connection failures) to verify error categorization
  • Verify routing metrics show correct local vs remote decisions
  • Check that hop count metrics are recorded for cross-region requests

Checklist

Required

  • Filled out the "How to test" section in this PR
  • Read Contributing Guide
  • Self-reviewed my own code
  • Commented on my code in hard-to-understand areas
  • Ran pnpm build
  • Ran pnpm fmt
  • Ran make fmt on /go directory
  • Checked for warnings, there are none
  • Removed all console.logs
  • Merged the latest changes from main onto my branch with git pull origin main
  • My changes don't cause any responsiveness issues

Appreciated

  • If a UI change was made: Added a screen recording or screenshots to this PR
  • Updated the Unkey Docs if changes were necessary

@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 9, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dashboard Ready Ready Preview, Comment Mar 24, 2026 0:24am

Request Review

Copy link
Copy Markdown
Member Author

Flo4604 commented Mar 9, 2026

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 9, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b3d5c856-fd45-4236-807b-457defb5e5df

📥 Commits

Reviewing files that changed from the base of the PR and between a759368 and ca3eab6.

📒 Files selected for processing (5)
  • svc/frontline/services/proxy/forward.go
  • svc/frontline/services/proxy/metrics.go
  • svc/frontline/services/router/lookup.go
  • svc/frontline/services/router/metrics.go
  • svc/frontline/services/router/service.go

📝 Walkthrough

Walkthrough

Renamed existing middleware metrics to use the "unkey" namespace and added Prometheus instrumentation across the proxy and router services, including new metric declarations and runtime increments for proxy forwarding, backend responses, hop counts, routing decisions, errors, and durations.

Changes

Cohort / File(s) Summary
Middleware metrics rename
svc/frontline/middleware/observability.go
Renamed three Prometheus metrics to use Namespace:"unkey", Subsystem:"frontline" with new Names (requests_total, request_duration_seconds, active_requests). Help text, buckets, labels, and update sites unchanged.
Proxy: build & metrics
svc/frontline/services/proxy/BUILD.bazel, svc/frontline/services/proxy/metrics.go
Added metrics.go and Prometheus promauto deps. Declares proxyForwardTotal (destination,error), proxyBackendDuration (destination), proxyHopsTotal (hop buckets), and proxyBackendResponseTotal (destination,source,status_class).
Proxy: instrumentation & logic
svc/frontline/services/proxy/forward.go, svc/frontline/services/proxy/service.go
Replaced logTarget with destination. Instrumented ReverseProxy path: record backend start/duration in ModifyResponse/ErrorHandler; increment proxyBackendResponseTotal with source (from X-Unkey-Error-Source defaulting to upstream/sentinel), increment proxyForwardTotal using categorized error types (via new categorizeProxyErrorType) and statusClass; always observe parsed hop counts via proxyHopsTotal and reject only when hops >= max.
Router: build & metrics
svc/frontline/services/router/BUILD.bazel, svc/frontline/services/router/metrics.go
Added metrics.go and Prometheus deps. Declares routingDecisionsTotal (decision,target_region), routingErrorsTotal (reason), and routingDuration (histogram).
Router: instrumentation
svc/frontline/services/router/lookup.go, svc/frontline/services/router/routing.go, svc/frontline/services/router/service.go
Added metric increments on error paths: routingErrorsTotal in lookup and selectSentinel (multiple distinct reasons). Route() now measures duration, records routing decisions (local_sentinel or remote_region) and increments error metric on instance-load failure.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Proxy as FrontlineProxy
  participant Backend as UpstreamBackend
  participant Metrics as Prometheus

  Client->>Proxy: HTTP request
  Proxy->>Proxy: start backendStart = now
  Proxy->>Backend: forward request
  alt backend success
    Backend-->>Proxy: response (2xx/3xx/4xx/5xx)
    Proxy->>Proxy: ModifyResponse observes duration
    Proxy->>Metrics: proxyBackendDuration.Observe(destination, duration)
    Proxy->>Metrics: proxyBackendResponseTotal.Inc(destination, source, status_class)
    Proxy->>Metrics: proxyForwardTotal.Inc(destination, "none" or categorized)
    Proxy-->>Client: proxied response
  else backend error
    Backend-->>Proxy: network/error
    Proxy->>Proxy: ErrorHandler observes duration
    Proxy->>Metrics: proxyBackendDuration.Observe(destination, duration)
    Proxy->>Metrics: proxyBackendResponseTotal.Inc(destination, source, status_class)
    Proxy->>Metrics: proxyForwardTotal.Inc(destination, categorized_error)
    Proxy-->>Client: fault error
  end
  Proxy->>Metrics: proxyHopsTotal.Observe(hops) (if header present)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description includes the what/why summary, type selection, and detailed test instructions, but the 'Fixes' issue reference is not completed and none of the required checklist items are marked as done. Complete the 'Fixes # (issue)' reference with an actual issue number, or remove the placeholder if no issue exists. Consider marking relevant checklist items (e.g., formatting, code review) once completed.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding Prometheus metrics for frontline routing and proxy components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch frontline-prom-metrics

Comment @coderabbitai help to get the list of available commands and usage tips.

@Flo4604 Flo4604 force-pushed the frontline-prom-metrics branch from 3366b70 to 6270600 Compare March 9, 2026 19:41
@Flo4604 Flo4604 force-pushed the instance-aware-routing branch from 66f3e36 to 9adedb9 Compare March 10, 2026 08:34
@Flo4604 Flo4604 force-pushed the frontline-prom-metrics branch from 6270600 to 155ce87 Compare March 10, 2026 08:34
@Flo4604 Flo4604 force-pushed the instance-aware-routing branch from 9adedb9 to cc3a71d Compare March 10, 2026 17:58
@Flo4604 Flo4604 force-pushed the frontline-prom-metrics branch from 155ce87 to 61ced6a Compare March 10, 2026 17:58
@Flo4604 Flo4604 force-pushed the frontline-prom-metrics branch from 61ced6a to 61a8527 Compare March 11, 2026 12:18
@Flo4604 Flo4604 mentioned this pull request Mar 11, 2026
19 tasks
@Flo4604 Flo4604 force-pushed the frontline-prom-metrics branch from 61a8527 to e2de48e Compare March 12, 2026 09:53
@Flo4604 Flo4604 force-pushed the instance-aware-routing branch from cc3a71d to 75cfe4e Compare March 12, 2026 09:53
@Flo4604 Flo4604 force-pushed the frontline-prom-metrics branch from e2de48e to 520e4c6 Compare March 12, 2026 17:14
@Flo4604 Flo4604 force-pushed the instance-aware-routing branch from 75cfe4e to 01dac21 Compare March 12, 2026 17:14
Comment thread svc/frontline/services/proxy/metrics.go Outdated
Comment thread svc/frontline/services/proxy/metrics.go Outdated
Comment thread svc/frontline/services/proxy/metrics.go
Comment thread svc/frontline/services/proxy/metrics.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
svc/frontline/services/router/lookup.go (1)

67-74: ⚠️ Potential issue | 🟠 Major

Double-counting: instance_load_failed incremented here and in service.go:47.

When getInstances fails, this line increments the counter, then the caller in service.go also increments routingErrorsTotal.WithLabelValues("instance_load_failed"). Remove one of the increments.

Proposed fix: remove duplicate in lookup.go
 	if err != nil && !mysql.IsNotFound(err) {
-		routingErrorsTotal.WithLabelValues("instance_load_failed").Inc()
 		return nil, fault.Wrap(err,
 			fault.Code(codes.Frontline.Internal.ConfigLoadFailed.URN()),
 			fault.Internal("error loading instances"),
 			fault.Public("Failed to load instance configuration"),
 		)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@svc/frontline/services/router/lookup.go` around lines 67 - 74, The duplicate
metric increment occurs in the error path inside lookup.go (the block that
returns a wrapped fault when getInstances fails) where
routingErrorsTotal.WithLabelValues("instance_load_failed").Inc() is called;
remove that increment so the counter is only incremented by the caller in
service.go (around service.go:47) and keep the existing fault.Wrap(...) return
as-is to preserve error semantics.
🧹 Nitpick comments (1)
svc/frontline/middleware/observability.go (1)

24-27: Series rename needs rollout plan.

These lines rename existing series. Any dashboard/alert/rule on frontline_* goes dark on deploy. Ship consumer updates with this PR, or dual-publish one release.

Also applies to: 34-38, 45-48

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@svc/frontline/middleware/observability.go` around lines 24 - 27, The change
renames existing Prometheus series (Namespace/Subsystem/Name) which will break
dashboards/alerts; instead emit both old and new series during rollout: keep the
new metric (e.g., the collector using Name: "requests_total") and also register
a second collector with the original namespace/subsystem/name (the previous
frontline_* naming) so both series are published concurrently; apply the same
dual-publish pattern to the other renamed metrics referenced around the other
diffs (lines 34-38 and 45-48) by creating and registering duplicate collectors
(e.g., requestsTotal and requestsTotalLegacy) or otherwise publishing the legacy
metric until consumers are migrated, then remove the legacy collectors in a
follow-up release.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@svc/frontline/services/proxy/forward.go`:
- Around line 147-149: The metric proxyBackendDuration is measured around
proxy.ServeHTTP which blocks until the response body is fully copied to the
client, so it incorrectly includes client streaming time; remove the observe
call around ServeHTTP and instead record backend latency inside the reverse
proxy callbacks by creating an observeBackend() helper and calling it at the
start of both the ReverseProxy ModifyResponse and ErrorHandler handlers
(referencing proxy.ServeHTTP, proxyBackendDuration, ModifyResponse,
ErrorHandler, and observeBackend) so the duration is observed when the backend
response or error is received rather than after client copy completes.

In `@svc/frontline/services/proxy/service.go`:
- Around line 144-160: The proxy currently calls proxyHopsTotal.Observe using
HeaderFrontlineHops in the session before makeRegionDirector increments the hop
count, causing off-by-one/missing metrics; update the code so the metric is
recorded after the hop increment instead—either move the Observe call out of the
session handler and into the director code path where makeRegionDirector (or the
director method that performs the hop increment) increments the hop header, or
have the director emit proxyHopsTotal.Observe with the new incremented value;
reference HeaderFrontlineHops, proxyHopsTotal.Observe, makeRegionDirector, and
the director hop-increment logic when making the change.

In `@svc/frontline/services/router/service.go`:
- Around line 39-43: The routingErrorsTotal metric is being incremented with
label "config_not_found" for all errors returned by lookupByHostname, but
lookupByHostname can return different fault codes from findRoute
(ConfigLoadFailed, ConfigNotFound) and sentinel loading (ConfigLoadFailed);
update the error handling so the metric label matches the actual failure: either
(A) have lookupByHostname return a typed error or fault code and switch on that
in the caller to call
routingErrorsTotal.WithLabelValues("<correct_code>").Inc(), or (B) move the
metric increment into the lower-level functions (findRoute and sentinel loader)
where the concrete error type is known and increment with the appropriate label
there; reference lookupByHostname, findRoute and routingErrorsTotal when making
the change.
- Around line 45-49: The metric
routingErrorsTotal.WithLabelValues("instance_load_failed").Inc() is being
double-counted: it's incremented both at the call site in service.go (around
s.getInstances(ctx, route.DeploymentID)) and inside the callee getInstances
(lookup.go). Remove the duplicate increment from the callee: delete the
routingErrorsTotal.WithLabelValues("instance_load_failed").Inc() inside the
getInstances implementation in lookup.go so the single increment remains at the
s.getInstances call site (or alternatively remove the call-site increment and
keep the one in getInstances if you prefer), ensuring only one location
increments that metric.

---

Outside diff comments:
In `@svc/frontline/services/router/lookup.go`:
- Around line 67-74: The duplicate metric increment occurs in the error path
inside lookup.go (the block that returns a wrapped fault when getInstances
fails) where routingErrorsTotal.WithLabelValues("instance_load_failed").Inc() is
called; remove that increment so the counter is only incremented by the caller
in service.go (around service.go:47) and keep the existing fault.Wrap(...)
return as-is to preserve error semantics.

---

Nitpick comments:
In `@svc/frontline/middleware/observability.go`:
- Around line 24-27: The change renames existing Prometheus series
(Namespace/Subsystem/Name) which will break dashboards/alerts; instead emit both
old and new series during rollout: keep the new metric (e.g., the collector
using Name: "requests_total") and also register a second collector with the
original namespace/subsystem/name (the previous frontline_* naming) so both
series are published concurrently; apply the same dual-publish pattern to the
other renamed metrics referenced around the other diffs (lines 34-38 and 45-48)
by creating and registering duplicate collectors (e.g., requestsTotal and
requestsTotalLegacy) or otherwise publishing the legacy metric until consumers
are migrated, then remove the legacy collectors in a follow-up release.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7e0b0192-dbfd-48c4-a461-2e7ee6ae100a

📥 Commits

Reviewing files that changed from the base of the PR and between b55f66d and 1226d14.

📒 Files selected for processing (10)
  • svc/frontline/middleware/observability.go
  • svc/frontline/services/proxy/BUILD.bazel
  • svc/frontline/services/proxy/forward.go
  • svc/frontline/services/proxy/metrics.go
  • svc/frontline/services/proxy/service.go
  • svc/frontline/services/router/BUILD.bazel
  • svc/frontline/services/router/lookup.go
  • svc/frontline/services/router/metrics.go
  • svc/frontline/services/router/routing.go
  • svc/frontline/services/router/service.go

Comment thread svc/frontline/services/proxy/forward.go Outdated
Comment thread svc/frontline/services/proxy/service.go
Comment thread svc/frontline/services/router/service.go
Comment thread svc/frontline/services/router/service.go
Flo4604 added 2 commits March 24, 2026 12:41
Adds metrics to detect routing and proxy issues:

Router:
- decisions_total: local vs remote routing split by target region
- errors_total: no_running_instances, no_sentinels_for_instances, config_not_found
- duration_seconds: route decision latency (cache + DB)
- running_instances: distribution of available instances per decision

Proxy:
- forward_total: success/failure by target and error type (timeout, conn_refused, etc.)
- backend_duration_seconds: backend response latency by target
- backend_response_total: HTTP status classes (2xx/4xx/5xx) from backends
- hops: cross-region hop count distribution
Copy link
Copy Markdown
Contributor

@ogzhanolguncu ogzhanolguncu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rabbit comments might worth checking

@Flo4604 Flo4604 force-pushed the frontline-prom-metrics branch from a759368 to ca3eab6 Compare March 24, 2026 12:22
@Flo4604 Flo4604 enabled auto-merge March 24, 2026 12:23
@Flo4604 Flo4604 added this pull request to the merge queue Mar 24, 2026
Merged via the queue into main with commit f276378 Mar 24, 2026
9 of 10 checks passed
@Flo4604 Flo4604 deleted the frontline-prom-metrics branch March 24, 2026 12:31
Copy link
Copy Markdown
Member Author

Flo4604 commented Mar 25, 2026

Merge activity

  • Mar 25, 5:43 PM UTC: A user started a stack merge that includes this pull request via Graphite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants