Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughThis pull request adds comprehensive Prometheus metrics instrumentation across the Sentinel service, including engine policy evaluation tracking, middleware error categorization, proxy upstream response metrics, and router service instance selection tracking. Additionally, it introduces a new Changes
Sequence Diagram(s)sequenceDiagram
actor ProxyHandler as Proxy Handler
participant RouterService as Router Service
participant PolicyCache as Policy Cache
participant Engine as Engine
participant DB as Database
ProxyHandler->>RouterService: GetPolicies(ctx, deployment)
RouterService->>PolicyCache: SWR lookup (deploymentID)
alt Cache Hit
PolicyCache-->>RouterService: []*sentinelv1.Policy
else Cache Miss
RouterService->>DB: Get deployment details
DB-->>RouterService: deployment config
RouterService->>Engine: ParseMiddleware(SentinelConfig)
Engine-->>RouterService: []*sentinelv1.Policy (or error)
RouterService->>PolicyCache: Store policies with TTL
PolicyCache-->>RouterService: cached
end
RouterService-->>ProxyHandler: []*sentinelv1.Policy
ProxyHandler->>ProxyHandler: Evaluate policies using cached result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.3)Command failed Comment |
61a8527 to
e2de48e
Compare
9b71f71 to
3a8cd43
Compare
3a8cd43 to
e299fdf
Compare
e2de48e to
520e4c6
Compare
e299fdf to
c336f6d
Compare
6ddfc5a to
71aaa96
Compare
c336f6d to
0ba3c6e
Compare
71aaa96 to
d93b474
Compare
0ba3c6e to
7bcae14
Compare
7bcae14 to
e4cf96a
Compare
d93b474 to
da2d1e1
Compare
da2d1e1 to
96a0215
Compare
e4cf96a to
86ca217
Compare
86ca217 to
f57a27c
Compare
59db08a to
bda6c5e
Compare
979c6ae to
f478540
Compare
bda6c5e to
82121dd
Compare
82121dd to
9675019
Compare
f478540 to
1226d14
Compare
1226d14 to
a759368
Compare
9675019 to
79fdfc8
Compare
79fdfc8 to
14563fb
Compare
a759368 to
ca3eab6
Compare
…ract shared timer - Add sentinel routing metrics: instance selection outcomes + duration histograms - Add sentinel engine metrics: policy evaluation counts by type and result - Add sentinel proxy error + upstream response metrics - Cache parsed sentinel policies in router service to avoid proto unmarshalling on every request - Extract shared timer package (pkg/prometheus/timer) to avoid copy-pasting duration helpers
14563fb to
108b602
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
svc/sentinel/routes/proxy/handler.go (1)
157-168:⚠️ Potential issue | 🟡 MinorEmit upstream metrics even when tracking missing.
Handlealready tolerates missing tracking context. KeepingupstreamResponseTotaland duration underif tracking != nilmeans those requests vanish from/metrics. Move the counter outside this block; if duration should survive too, keep a local upstream start timestamp.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@svc/sentinel/routes/proxy/handler.go` around lines 157 - 168, The metrics emission is currently inside the tracking nil check so requests without tracking are omitted from metrics; move the upstream metric logic out of the "if tracking != nil" block in the ModifyResponse function so upstreamResponseTotal.WithLabelValues(upstreamStatusClass(resp.StatusCode)).Inc() always runs, and for upstreamDuration use a local start timestamp (e.g. capture a start := tracking.InstanceStart if tracking != nil else timeZero or time.Now() at request start) so you can safely compute and call upstreamDuration.WithLabelValues(...).Observe(...) without dereferencing tracking; update references to tracking.InstanceEnd/ResponseStatus/ResponseHeaders to remain guarded by the existing tracking != nil check.
🧹 Nitpick comments (1)
svc/sentinel/middleware/observability.go (1)
21-49: Metric schema break; migrate queries same deploy.These renames plus dropping
environment_idwill break existing PromQL/rules on these series. Roll dashboard/alert/recording-rule updates with the code, or keep a short compat window.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@svc/sentinel/middleware/observability.go` around lines 21 - 49, The metric label changes in sentinelRequestsTotal, sentinelRequestDuration and sentinelActiveRequests drop/rename the existing environment_id label which will break PromQL, dashboards, alerts and recording rules; either restore the previous label set (re-add "environment_id" to the label lists for sentinelRequestsTotal, sentinelRequestDuration and sentinelActiveRequests) or emit compatibility metrics alongside the new schema (duplicate the metrics with the old label set) and coordinate deploying dashboard/alert/recording-rule updates in the same release; update the metric definitions in observability.go (symbols: sentinelRequestsTotal, sentinelRequestDuration, sentinelActiveRequests) accordingly so queries continue to work during rollout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@svc/sentinel/middleware/metrics.go`:
- Around line 28-55: categorizeProxyErrorTypeForMetrics currently maps
EHOSTUNREACH to "other", losing the ServiceUnavailable signal; update it to
detect host-unreachable errors (check syscall.EHOSTUNREACH and inspect
net.OpError.Err for EHOSTUNREACH) and return the same metric label used by the
higher-level classifier (e.g., "service_unavailable" or the label used by
categorizeProxyError) so both metrics and response classification stay
consistent, and consider refactoring by extracting a shared typed classifier
used by both categorizeProxyError and categorizeProxyErrorTypeForMetrics to
avoid duplicated logic.
In `@svc/sentinel/routes/proxy/metrics.go`:
- Around line 31-42: The upstreamStatusClass function currently treats any code
<300 that isn't 3xx/4xx/5xx as "2xx", which mislabels 1xx and invalid codes;
update upstreamStatusClass to explicitly handle 100-199 as "1xx", 200-299 as
"2xx" and return a clear fallback (e.g., "unknown" or "invalid") for codes <=0
or outside 1xx-5xx, and adjust the switch/conditions accordingly in the
upstreamStatusClass function so callers receive correct status buckets.
In `@svc/sentinel/services/router/service.go`:
- Around line 239-247: The code incorrectly uses sentinelInstanceSelectionTotal
for deployment lookup failures; create a separate metric (e.g.,
sentinelDeploymentLookupTotal) or add a stage label to
sentinelInstanceSelectionTotal and use that for lookup outcomes, then replace
the increments around GetDeployment failure branches (the blocks that call
sentinelInstanceSelectionTotal.WithLabelValues("error") and
WithLabelValues("deployment_not_found") near the cache.Null / db.IsNotFound
checks and the similar occurrence at the other location) so lookup failures
increment the new lookup metric (or use the "lookup" stage label) while
preserving sentinelInstanceSelectionTotal for true instance-selection outcomes
only.
---
Outside diff comments:
In `@svc/sentinel/routes/proxy/handler.go`:
- Around line 157-168: The metrics emission is currently inside the tracking nil
check so requests without tracking are omitted from metrics; move the upstream
metric logic out of the "if tracking != nil" block in the ModifyResponse
function so
upstreamResponseTotal.WithLabelValues(upstreamStatusClass(resp.StatusCode)).Inc()
always runs, and for upstreamDuration use a local start timestamp (e.g. capture
a start := tracking.InstanceStart if tracking != nil else timeZero or time.Now()
at request start) so you can safely compute and call
upstreamDuration.WithLabelValues(...).Observe(...) without dereferencing
tracking; update references to
tracking.InstanceEnd/ResponseStatus/ResponseHeaders to remain guarded by the
existing tracking != nil check.
---
Nitpick comments:
In `@svc/sentinel/middleware/observability.go`:
- Around line 21-49: The metric label changes in sentinelRequestsTotal,
sentinelRequestDuration and sentinelActiveRequests drop/rename the existing
environment_id label which will break PromQL, dashboards, alerts and recording
rules; either restore the previous label set (re-add "environment_id" to the
label lists for sentinelRequestsTotal, sentinelRequestDuration and
sentinelActiveRequests) or emit compatibility metrics alongside the new schema
(duplicate the metrics with the old label set) and coordinate deploying
dashboard/alert/recording-rule updates in the same release; update the metric
definitions in observability.go (symbols: sentinelRequestsTotal,
sentinelRequestDuration, sentinelActiveRequests) accordingly so queries continue
to work during rollout.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 6197a52e-8b06-4759-a28b-f23265f5aa48
📒 Files selected for processing (14)
svc/sentinel/engine/BUILD.bazelsvc/sentinel/engine/engine.gosvc/sentinel/engine/metrics.gosvc/sentinel/middleware/BUILD.bazelsvc/sentinel/middleware/error_handling.gosvc/sentinel/middleware/metrics.gosvc/sentinel/middleware/observability.gosvc/sentinel/routes/proxy/BUILD.bazelsvc/sentinel/routes/proxy/handler.gosvc/sentinel/routes/proxy/metrics.gosvc/sentinel/services/router/BUILD.bazelsvc/sentinel/services/router/interface.gosvc/sentinel/services/router/metrics.gosvc/sentinel/services/router/service.go
| func categorizeProxyErrorTypeForMetrics(err error) string { | ||
| if errors.Is(err, context.Canceled) { | ||
| return "client_canceled" | ||
| } | ||
| if errors.Is(err, context.DeadlineExceeded) || os.IsTimeout(err) { | ||
| return "timeout" | ||
| } | ||
|
|
||
| var netErr *net.OpError | ||
| if errors.As(err, &netErr) { | ||
| if netErr.Timeout() { | ||
| return "timeout" | ||
| } | ||
| if errors.Is(netErr.Err, syscall.ECONNREFUSED) { | ||
| return "conn_refused" | ||
| } | ||
| if errors.Is(netErr.Err, syscall.ECONNRESET) { | ||
| return "conn_reset" | ||
| } | ||
| } | ||
|
|
||
| var dnsErr *net.DNSError | ||
| if errors.As(err, &dnsErr) { | ||
| return "dns_failure" | ||
| } | ||
|
|
||
| return "other" | ||
| } |
There was a problem hiding this comment.
EHOSTUNREACH falls into other.
categorizeProxyError treats host-unreachable as ServiceUnavailable, but this helper has no matching branch, so the new metric loses that signal. Better: share one typed classifier for URN/message + metric label.
As per coding guidelines, "Make illegal states unrepresentable by modeling domain with ADTs/discriminated unions and parsing inputs at boundaries into typed structures".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@svc/sentinel/middleware/metrics.go` around lines 28 - 55,
categorizeProxyErrorTypeForMetrics currently maps EHOSTUNREACH to "other",
losing the ServiceUnavailable signal; update it to detect host-unreachable
errors (check syscall.EHOSTUNREACH and inspect net.OpError.Err for EHOSTUNREACH)
and return the same metric label used by the higher-level classifier (e.g.,
"service_unavailable" or the label used by categorizeProxyError) so both metrics
and response classification stay consistent, and consider refactoring by
extracting a shared typed classifier used by both categorizeProxyError and
categorizeProxyErrorTypeForMetrics to avoid duplicated logic.
| func upstreamStatusClass(code int) string { | ||
| switch { | ||
| case code >= 500: | ||
| return "5xx" | ||
| case code >= 400: | ||
| return "4xx" | ||
| case code >= 300: | ||
| return "3xx" | ||
| default: | ||
| return "2xx" | ||
| } | ||
| } |
There was a problem hiding this comment.
upstreamStatusClass returns "2xx" for 1xx and invalid status codes.
Codes below 200 (including 1xx informational, 0, or negative values) fall through to "2xx". Consider explicit handling:
Proposed fix
func upstreamStatusClass(code int) string {
switch {
case code >= 500:
return "5xx"
case code >= 400:
return "4xx"
case code >= 300:
return "3xx"
- default:
+ case code >= 200:
return "2xx"
+ default:
+ return "other"
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@svc/sentinel/routes/proxy/metrics.go` around lines 31 - 42, The
upstreamStatusClass function currently treats any code <300 that isn't
3xx/4xx/5xx as "2xx", which mislabels 1xx and invalid codes; update
upstreamStatusClass to explicitly handle 100-199 as "1xx", 200-299 as "2xx" and
return a clear fallback (e.g., "unknown" or "invalid") for codes <=0 or outside
1xx-5xx, and adjust the switch/conditions accordingly in the upstreamStatusClass
function so callers receive correct status buckets.
| sentinelInstanceSelectionTotal.WithLabelValues("error").Inc() | ||
| return db.Deployment{}, fault.Wrap(err, | ||
| fault.Code(codes.Sentinel.Internal.InternalServerError.URN()), | ||
| fault.Internal("failed to get deployment"), | ||
| ) | ||
| } | ||
|
|
||
| if hit == cache.Null || db.IsNotFound(err) { | ||
| sentinelInstanceSelectionTotal.WithLabelValues("deployment_not_found").Inc() |
There was a problem hiding this comment.
deployment_not_found isn't an instance-selection outcome.
GetDeployment now increments sentinelInstanceSelectionTotal for lookup failures. That muddies the metric: dashboards can't tell whether routing failed before selection or during selection. Use a separate lookup counter, or add a stage label.
Also applies to: 262-262
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@svc/sentinel/services/router/service.go` around lines 239 - 247, The code
incorrectly uses sentinelInstanceSelectionTotal for deployment lookup failures;
create a separate metric (e.g., sentinelDeploymentLookupTotal) or add a stage
label to sentinelInstanceSelectionTotal and use that for lookup outcomes, then
replace the increments around GetDeployment failure branches (the blocks that
call sentinelInstanceSelectionTotal.WithLabelValues("error") and
WithLabelValues("deployment_not_found") near the cache.Null / db.IsNotFound
checks and the similar occurrence at the other location) so lookup failures
increment the new lookup metric (or use the "lookup" stage label) while
preserving sentinelInstanceSelectionTotal for true instance-selection outcomes
only.

What does this PR do?
Adds Prometheus metrics instrumentation to the Sentinel service to improve observability and monitoring capabilities.
New metrics added:
sentinel_engine_evaluations_total- Counts policy evaluations by type (keyauth) and result (success/denied/error/skipped)sentinel_proxy_errors_total- Tracks proxy errors categorized by type (timeout, connection refused, DNS failure, etc.)sentinel_upstream_response_total- Counts upstream responses by HTTP status class (2xx, 3xx, 4xx, 5xx)sentinel_upstream_duration_seconds- Measures backend response latency excluding Sentinel overheadsentinel_routing_instance_selection_total- Tracks instance selection outcomessentinel_routing_duration_seconds- Measures duration of routing operationsKey changes:
Type of change
How should this be tested?
/metricsendpoint exposes new metricsChecklist
Required
pnpm buildpnpm fmtmake fmton/godirectoryconsole.logsgit pull origin mainAppreciated