Skip to content

feat: cancel deployment#5274

Open
chronark wants to merge 1 commit intomainfrom
cancel-deployment
Open

feat: cancel deployment#5274
chronark wants to merge 1 commit intomainfrom
cancel-deployment

Conversation

@chronark
Copy link
Collaborator

@chronark chronark commented Mar 10, 2026

stores the restate invocation id of a deployment in order to be able to cancel/kill it later

I did not review this yet
@vercel
Copy link

vercel bot commented Mar 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dashboard Ready Ready Preview, Comment Mar 10, 2026 7:09am

Request Review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a cancel deployment feature across the entire application stack. The changes add database support for tracking invocation IDs and a cancelled deployment status, extend the control service protobuf with a new CancelDeployment RPC, implement the corresponding backend handler, integrate Restate admin client for invocation termination, and update the frontend to display and trigger cancellation UI.

Changes

Cohort / File(s) Summary
Database Schema & Models
pkg/db/schema.sql, pkg/db/models_generated.go, web/internal/db/src/schema/deployments.ts
Added invocation_id varchar column to deployments table and introduced DeploymentsStatusCancelled enum value alongside existing deployment statuses.
Generated SQL Queries
pkg/db/deployment_find_by_id.sql_generated.go, pkg/db/deployment_find_by_k8s_name.sql_generated.go, pkg/db/deployment_list_by_environment_id_and_status.sql_generated.go, pkg/db/deployment_topology_list_by_versions.sql_generated.go, pkg/db/deployment_topology_list_desired.sql_generated.go
Updated SELECT statements and row scanning to include invocation_id column in all deployment query results; added InvocationID field population across multiple generated query handlers.
Deployment Update Generated Code
pkg/db/queries/deployment_update_invocation_id.sql, pkg/db/deployment_update_invocation_id.sql_generated.go, pkg/db/querier_generated.go, pkg/db/BUILD.bazel
New SQL query and generated Go code to update deployments' invocation_id and updated_at fields; added UpdateDeploymentInvocationID method to Querier interface and generated implementation.
Restate Admin Client
pkg/restate/admin/client.go
Added KillInvocation method to terminate Restate invocations; refactored invocation termination logic into shared terminateInvocation helper supporting both cancel and kill modes; broadened success conditions to accept any 2xx status.
Control Service Protocol
svc/ctrl/proto/ctrl/v1/deployment.proto
Added DEPLOYMENT_STATUS_CANCELLED enum value, introduced CancelDeploymentRequest and CancelDeploymentResponse messages, and added CancelDeployment RPC to DeployService.
Deployment Service Core
svc/ctrl/services/deployment/service.go, svc/ctrl/services/deployment/cancel_deployment.go, svc/ctrl/services/deployment/create_deployment.go, svc/ctrl/services/deployment/get_deployment.go
Added RestateAdmin client field to Service config and initialization; implemented new CancelDeployment handler with status validation and invocation termination; updated deployment creation to persist invocation IDs to database; extended status conversion to handle cancelled state.
Deployment Build Configuration
svc/ctrl/services/deployment/BUILD.bazel, svc/ctrl/api/run.go
Added cancel_deployment.go source file and //pkg/restate/admin dependency; wired RestateAdmin client into deployment service configuration.
Worker Workflows
svc/ctrl/worker/deploy/deploy_handler.go, svc/ctrl/worker/githubwebhook/handle_push.go
Inserted two 30-second delays in Deploy workflow; modified webhook handler to capture and persist deployment invocation IDs from Restate workflow responses.
Control API
svc/api/routes/v2_deploy_get_deployment/handler.go, svc/api/openapi/..., svc/api/internal/testutil/mock_deployment_client.go
Added handling for DeploymentsStatusCancelled status in response mapping; added CANCELLED to OpenAPI deployment status enums; extended mock deployment client with CancelDeployment method and tracking fields.
Logging Utilities
svc/frontline/services/router/BUILD.bazel, svc/frontline/services/router/service.go
Added //pkg/logger dependency and implemented error logging with environment context in sentinel lookup failures.
Frontend Cancel Dialog Component
web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/actions/cancel-dialog.tsx
New React component rendering a confirmation dialog for deployment cancellation; integrates TRPC mutation, handles success/error states, and manages loading feedback via toast notifications.
Frontend Action Menu & Filters
web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/actions/deployment-list-table-action.popover.constants.tsx, web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/filters.schema.ts
Added cancel action to deployment menu with conditional enablement based on deployment status; expanded DEPLOYMENT_STATUSES and GROUPED_DEPLOYMENT_STATUSES to include cancelled; updated filter color mapping and status expansion logic.
Frontend Status Display
web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/deployment-status-badge.tsx, web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/components/deployment-status-badge.tsx
Added cancelled status badge configuration with Ban icon and gray styling to deployment status badge components; updated status type unions to include cancelled.
Frontend Deployment List & Details
web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/deployments-list.tsx, web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/domain_list.tsx, web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/utils/get-row-class.ts
Refactored terminal state logic to treat cancelled and failed as equivalent for resource display suppression and placeholder rendering; unified styling for both terminal statuses.
Frontend Deployment Progress
web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/[deploymentId]/(deployment-progress)/deployment-progress.tsx
Introduced isCancelled flag to distinguish cancelled from failed deployments; updated step descriptions and final messaging to reflect cancellation context; refined error banner styling and text based on actual failure versus cancellation.
Frontend TRPC Mutation & Router
web/apps/dashboard/lib/trpc/routers/deploy/deployment/cancel.ts, web/apps/dashboard/lib/trpc/routers/index.ts
New cancel mutation procedure validating deployment access, checking cancellable statuses, invoking control service gRPC, applying rate limiting, and returning status; exposed cancel endpoint in deployment router.
Frontend Schema
web/apps/dashboard/lib/collections/deploy/deployments.ts
Extended Deployment schema to include cancelled status value in the status enum.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is minimal and lacks critical required sections from the template, including issue reference, type of change selection, testing instructions, and checklist items, though it does convey the core intent. Add missing template sections: reference an issue, select type of change, describe test procedures, and complete the required checklist items before merging.
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: cancel deployment' directly addresses the main feature being introduced throughout the changeset: the ability to cancel deployments via storing invocation IDs and implementing cancel logic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cancel-deployment
📝 Coding Plan
  • Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

Command failed


Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

@chronark
Copy link
Collaborator Author

cries in merge conflicts

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/filters.schema.ts (1)

56-69: ⚠️ Potential issue | 🟡 Minor

ready still falls through to the default color.

Line 57 checks "completed", but this file exposes "ready" as the grouped terminal status. Ready filters will still render with bg-info-9 after this change.

💡 Minimal fix
-      if (value === "completed") {
+      if (value === "ready") {
         return "bg-success-9";
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@web/apps/dashboard/app/`(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/filters.schema.ts
around lines 56 - 69, getColorClass currently only checks for "completed" and
leaves the grouped terminal status "ready" falling through to the default;
update getColorClass to treat "ready" as a terminal/completed status (e.g.,
return the same "bg-success-9" as "completed") by adding a check for value ===
"ready" or combining it with the existing "completed" condition so ready filters
render with the correct green background.
web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/[deploymentId]/(deployment-progress)/deployment-progress.tsx (1)

21-53: ⚠️ Potential issue | 🟡 Minor

Cancelled deployments continue polling indefinitely.

Lines 28 and 38 refetch both queries every second unconditionally, but line 46 stops the local timer when isCancelled is true. Since cancelled deployments remain on this page without automatic redirect, the client will keep hitting both endpoints until the user navigates away.

Minimal fix
  const steps = trpc.deploy.deployment.steps.useQuery(
    {
      deploymentId: deployment.id,
    },
    {
-      refetchInterval: 1_000,
+      refetchInterval: isCancelled ? false : 1_000,
    },
  );
@@
  const buildSteps = trpc.deploy.deployment.buildSteps.useQuery(
    {
      deploymentId: deployment.id,
      includeStepLogs: true,
    },
    {
-      refetchInterval: 1_000,
+      refetchInterval: isCancelled ? false : 1_000,
    },
  );

This pattern is already used in the codebase (e.g., use-fetch-timeseries.ts).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@web/apps/dashboard/app/`(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/[deploymentId]/(deployment-progress)/deployment-progress.tsx
around lines 21 - 53, The steps and buildSteps queries are refetching every
second unconditionally causing continued polling for cancelled deployments;
update the trpc.deploy.deployment.steps.useQuery and
trpc.deploy.deployment.buildSteps.useQuery calls to disable polling when
isCancelled (or equivalently add enabled: !isCancelled or set refetchInterval:
undefined when isCancelled) so they stop refetching for cancelled deployments,
and remove any reliance on the local now timer in the queries (leave the
useEffect for now display untouched) — target the steps and buildSteps query
option objects to conditionally set refetchInterval or enabled based on the
isCancelled boolean.
🧹 Nitpick comments (3)
web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/actions/cancel-dialog.tsx (1)

35-42: Don’t swallow mutation failures in handleCancel.

mutateAsync already routes errors through onError; the extra .catch only logs and suppresses explicit failure handling at the call site.

💡 Proposed fix
   const handleCancel = async () => {
-    await cancelDeployment
-      .mutateAsync({
-        deploymentId: selectedDeployment.id,
-      })
-      .catch((error) => {
-        console.error("Cancel error:", error);
-      });
+    await cancelDeployment.mutateAsync({
+      deploymentId: selectedDeployment.id,
+    });
   };
As per coding guidelines: "Use `fault` library for error handling".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@web/apps/dashboard/app/`(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/actions/cancel-dialog.tsx
around lines 35 - 42, The current handleCancel swallows errors by attaching a
.catch to cancelDeployment.mutateAsync; remove that .catch so errors can
propagate to the mutation's onError handler, and instead route failures through
the project's fault error-handling mechanism (e.g., wrap or capture the promise
with the fault library as per guidelines) so failures are reported but not
suppressed; update handleCancel to call cancelDeployment.mutateAsync({
deploymentId: selectedDeployment.id }) without the .catch and integrate fault
(using the project's fault API) to record/handle any thrown error.
web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/actions/deployment-list-table-action.popover.constants.tsx (1)

40-48: Type the cancellable-status list as deployment statuses.

This currently relies on a plain string array; tightening it to Deployment["status"] avoids silent typos and keeps illegal states harder to represent.

💡 Proposed fix
-    const cancellableStatuses = [
+    const cancellableStatuses: ReadonlySet<Deployment["status"]> = new Set([
       "pending",
       "starting",
       "building",
       "deploying",
       "network",
       "finalizing",
-    ];
-    const canCancel = cancellableStatuses.includes(selectedDeployment.status);
+    ]);
+    const canCancel = cancellableStatuses.has(selectedDeployment.status);
As per coding guidelines: "Make illegal states unrepresentable by modeling domain with discriminated unions and parsing inputs at boundaries into typed structures".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@web/apps/dashboard/app/`(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/actions/deployment-list-table-action.popover.constants.tsx
around lines 40 - 48, The cancellableStatuses array is untyped and can silently
accept typos; change it to a typed array of the deployment status union (e.g.,
declare cancellableStatuses as Deployment["status"][] and import the Deployment
type) so the compiler enforces valid statuses, then keep the existing canCancel
check (const canCancel =
cancellableStatuses.includes(selectedDeployment.status)) to work with the typed
statuses; ensure the literal values exactly match members of
Deployment["status"] or adjust to use a const assertion if needed.
svc/ctrl/services/deployment/service.go (1)

33-45: Reject nil RestateAdmin in New.

CancelDeployment already treats a missing admin client as misconfiguration, but New still constructs Service in that state. Fail fast here so the service cannot start partially configured.

As per coding guidelines, "Make illegal states unrepresentable by modeling domain with ADTs/discriminated unions and parsing inputs at boundaries into typed structures".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@svc/ctrl/services/deployment/service.go` around lines 33 - 45, New currently
allows constructing a Service with a nil RestateAdmin, which defers a
configuration error to runtime; update New to validate cfg.RestateAdmin and fail
fast: check cfg.RestateAdmin for nil at the start of New and return a clear
fatal error (panic or log.Fatalf) with context naming RestateAdmin (or change
New's signature to return (*Service, error) and return an error) so the Service
cannot be constructed in a partially configured state; reference the New
function, Service struct and cfg.RestateAdmin when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/restate/admin/client.go`:
- Around line 105-125: The terminateInvocation function should reject empty
invocationID before constructing the URL and interpreting a 404 as success: add
a guard at the top of Client.terminateInvocation that returns a clear error
(e.g., fmt.Errorf or errors.New) when invocationID == "" so you never call c.do
or map a 404 to nil for a missing ID; keep the rest of the logic (URL formation,
response handling) unchanged and reference the same function name
terminateInvocation to locate where to add the check.

In `@svc/ctrl/services/deployment/create_deployment.go`:
- Around line 300-307: The code treats persisting the invocation ID
(db.Query.UpdateDeploymentInvocationID call using parameters ID=deploymentID and
InvocationID=sql.NullString{String: invocation.Id()}) as best-effort; instead
make persistence mandatory: after creating the workflow/invocation, if
UpdateDeploymentInvocationID returns an error, either retry the DB update a few
times with backoff or cancel/terminate the started invocation and return an
error to the caller so the deployment can be retried/rolled back. Concretely,
wrap the db.Query.UpdateDeploymentInvocationID call in a retry loop (with small
sleep/backoff) and only proceed/return success once it succeeds; if retries
exhaust, call the invocation cancel/terminate API (using invocation.Id()) and
return an error (no longer just logger.Warn) so the caller knows persistence
failed.

In `@svc/ctrl/worker/deploy/deploy_handler.go`:
- Around line 186-187: Replace the blocking time.Sleep call with a cancellable,
durable timer by using restate.Sleep(ctx, 30*time.Second) (or remove the sleep
if it was only for debugging); locate the blocking call in deploy_handler.go
(e.g., inside the deploy handler function where time.Sleep(30 * time.Second) is
used) and swap it to restate.Sleep(ctx, 30*time.Second) so the operation
respects the passed context and Restate retry/cancellation semantics.

In `@svc/ctrl/worker/githubwebhook/handle_push.go`:
- Around line 185-197: The persistence of invocationID (invocationID :=
inv.GetInvocationId()) must not be downgraded to a warn; if
UpdateDeploymentInvocationID (db.Query.UpdateDeploymentInvocationID) fails the
handler should surface/handle the error so the workflow retains a cancel handle.
Replace the logger.Warn path: make the call retryable or return the error from
the surrounding function (i.e., propagate the error instead of swallowing it).
Use a retry/backoff loop or the existing restate.RunVoid wrapper with retry
semantics (or wrap UpdateDeploymentInvocationID in a retry helper) and if
retries ultimately fail return an error (or abort the request) rather than
logging and continuing; keep references to invocationID, restate.RunVoid and
UpdateDeploymentInvocationID to locate the change.

In
`@web/apps/dashboard/app/`(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/actions/cancel-dialog.tsx:
- Around line 18-24: The onSuccess handler currently calls utils.invalidate()
and fires collection.deployments.utils.refetch() without awaiting it, causing
racey UI; update the handler to scope invalidation/refetch to the deployments
query (use collection.deployments.utils.invalidate() or
collection.deployments.utils.refetch()) and await the promise(s) (await
collection.deployments.utils.refetch() and await any async invalidate call)
inside a try/catch before closing the dialog or showing success to ensure the
deployments list is refreshed when the UI updates.

In `@web/apps/dashboard/lib/trpc/routers/deploy/deployment/cancel.ts`:
- Around line 75-85: The current catch on ctrl.cancelDeployment({ deploymentId:
input.deploymentId }) swallows control-plane errors and converts everything to a
generic INTERNAL_SERVER_ERROR with message: err; instead, detect and rethrow the
original error when it already represents a TRPCError (or has a code-like
property) so NOT_FOUND/FAILED_PRECONDITION are preserved, and when wrapping is
necessary create a TRPCError using a safe string message (e.g., err.message or
String(err)) rather than the raw err object; update the catch around
ctrl.cancelDeployment to check for err instanceof TRPCError (or err.code) and
throw err directly, otherwise throw new TRPCError({ code:
"INTERNAL_SERVER_ERROR", message: err.message ?? String(err) }).

---

Outside diff comments:
In
`@web/apps/dashboard/app/`(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/[deploymentId]/(deployment-progress)/deployment-progress.tsx:
- Around line 21-53: The steps and buildSteps queries are refetching every
second unconditionally causing continued polling for cancelled deployments;
update the trpc.deploy.deployment.steps.useQuery and
trpc.deploy.deployment.buildSteps.useQuery calls to disable polling when
isCancelled (or equivalently add enabled: !isCancelled or set refetchInterval:
undefined when isCancelled) so they stop refetching for cancelled deployments,
and remove any reliance on the local now timer in the queries (leave the
useEffect for now display untouched) — target the steps and buildSteps query
option objects to conditionally set refetchInterval or enabled based on the
isCancelled boolean.

In
`@web/apps/dashboard/app/`(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/filters.schema.ts:
- Around line 56-69: getColorClass currently only checks for "completed" and
leaves the grouped terminal status "ready" falling through to the default;
update getColorClass to treat "ready" as a terminal/completed status (e.g.,
return the same "bg-success-9" as "completed") by adding a check for value ===
"ready" or combining it with the existing "completed" condition so ready filters
render with the correct green background.

---

Nitpick comments:
In `@svc/ctrl/services/deployment/service.go`:
- Around line 33-45: New currently allows constructing a Service with a nil
RestateAdmin, which defers a configuration error to runtime; update New to
validate cfg.RestateAdmin and fail fast: check cfg.RestateAdmin for nil at the
start of New and return a clear fatal error (panic or log.Fatalf) with context
naming RestateAdmin (or change New's signature to return (*Service, error) and
return an error) so the Service cannot be constructed in a partially configured
state; reference the New function, Service struct and cfg.RestateAdmin when
making the change.

In
`@web/apps/dashboard/app/`(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/actions/cancel-dialog.tsx:
- Around line 35-42: The current handleCancel swallows errors by attaching a
.catch to cancelDeployment.mutateAsync; remove that .catch so errors can
propagate to the mutation's onError handler, and instead route failures through
the project's fault error-handling mechanism (e.g., wrap or capture the promise
with the fault library as per guidelines) so failures are reported but not
suppressed; update handleCancel to call cancelDeployment.mutateAsync({
deploymentId: selectedDeployment.id }) without the .catch and integrate fault
(using the project's fault API) to record/handle any thrown error.

In
`@web/apps/dashboard/app/`(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/actions/deployment-list-table-action.popover.constants.tsx:
- Around line 40-48: The cancellableStatuses array is untyped and can silently
accept typos; change it to a typed array of the deployment status union (e.g.,
declare cancellableStatuses as Deployment["status"][] and import the Deployment
type) so the compiler enforces valid statuses, then keep the existing canCancel
check (const canCancel =
cancellableStatuses.includes(selectedDeployment.status)) to work with the typed
statuses; ensure the literal values exactly match members of
Deployment["status"] or adjust to use a const assertion if needed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a6fee186-d49e-47a4-ad8d-31a20d780ffd

📥 Commits

Reviewing files that changed from the base of the PR and between 25c27f5 and 04548e9.

⛔ Files ignored due to path filters (4)
  • gen/proto/ctrl/v1/ctrlv1connect/deployment.connect.go is excluded by !**/gen/**
  • gen/proto/ctrl/v1/deployment.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • gen/rpc/ctrl/deployment_generated.go is excluded by !**/gen/**
  • web/apps/dashboard/gen/proto/ctrl/v1/deployment_pb.ts is excluded by !**/gen/**
📒 Files selected for processing (41)
  • pkg/db/BUILD.bazel
  • pkg/db/deployment_find_by_id.sql_generated.go
  • pkg/db/deployment_find_by_k8s_name.sql_generated.go
  • pkg/db/deployment_list_by_environment_id_and_status.sql_generated.go
  • pkg/db/deployment_topology_list_by_versions.sql_generated.go
  • pkg/db/deployment_topology_list_desired.sql_generated.go
  • pkg/db/deployment_update_invocation_id.sql_generated.go
  • pkg/db/models_generated.go
  • pkg/db/querier_generated.go
  • pkg/db/queries/deployment_update_invocation_id.sql
  • pkg/db/schema.sql
  • pkg/restate/admin/client.go
  • svc/api/internal/testutil/mock_deployment_client.go
  • svc/api/openapi/gen.go
  • svc/api/openapi/openapi-generated.yaml
  • svc/api/openapi/spec/paths/v2/deploy/getDeployment/V2DeployGetDeploymentResponseData.yaml
  • svc/api/routes/v2_deploy_get_deployment/handler.go
  • svc/ctrl/api/run.go
  • svc/ctrl/proto/ctrl/v1/deployment.proto
  • svc/ctrl/services/deployment/BUILD.bazel
  • svc/ctrl/services/deployment/cancel_deployment.go
  • svc/ctrl/services/deployment/create_deployment.go
  • svc/ctrl/services/deployment/get_deployment.go
  • svc/ctrl/services/deployment/service.go
  • svc/ctrl/worker/deploy/deploy_handler.go
  • svc/ctrl/worker/githubwebhook/handle_push.go
  • svc/frontline/services/router/BUILD.bazel
  • svc/frontline/services/router/service.go
  • web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/[deploymentId]/(deployment-progress)/deployment-progress.tsx
  • web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/actions/cancel-dialog.tsx
  • web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/actions/deployment-list-table-action.popover.constants.tsx
  • web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/deployment-status-badge.tsx
  • web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/domain_list.tsx
  • web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/deployments-list.tsx
  • web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/utils/get-row-class.ts
  • web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/filters.schema.ts
  • web/apps/dashboard/app/(app)/[workspaceSlug]/projects/[projectId]/components/deployment-status-badge.tsx
  • web/apps/dashboard/lib/collections/deploy/deployments.ts
  • web/apps/dashboard/lib/trpc/routers/deploy/deployment/cancel.ts
  • web/apps/dashboard/lib/trpc/routers/index.ts
  • web/internal/db/src/schema/deployments.ts

Comment on lines +51 to +83
// Cancel the Restate invocation if we have an invocation ID
if dep.InvocationID.Valid && dep.InvocationID.String != "" {
if s.restateAdmin == nil {
return nil, connect.NewError(connect.CodeInternal, fmt.Errorf("restate admin client is not configured"))
}

if killErr := s.restateAdmin.KillInvocation(ctx, dep.InvocationID.String); killErr != nil {
logger.Warn("failed to kill deployment invocation",
"deployment_id", deploymentID,
"invocation_id", dep.InvocationID.String,
"error", killErr,
)
// Continue with status update even if kill fails
} else {
logger.Info("killed deployment invocation",
"deployment_id", deploymentID,
"invocation_id", dep.InvocationID.String,
)
}
} else {
logger.Warn("no invocation ID stored for deployment, updating status only",
"deployment_id", deploymentID,
)
}

// Mark the deployment as cancelled
err = db.Query.UpdateDeploymentStatus(ctx, s.db.RW(), db.UpdateDeploymentStatusParams{
ID: deploymentID,
Status: db.DeploymentsStatusCancelled,
UpdatedAt: sql.NullInt64{Valid: true, Int64: time.Now().UnixMilli()},
})
if err != nil {
return nil, connect.NewError(connect.CodeInternal, fmt.Errorf("failed to update deployment status: %w", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Only write cancelled after a real, current cancel.

Both the killErr / missing-invocation path and the stale-read race still fall through to UpdateDeploymentStatus. That can mark a deployment cancelled even though its workflow kept running or already reached a terminal state. Make the status write conditional on successful termination and the deployment still being in a cancellable status.

Comment on lines +300 to +307
// Store invocation ID for later cancellation
if updateErr := db.Query.UpdateDeploymentInvocationID(ctx, s.db.RW(), db.UpdateDeploymentInvocationIDParams{
ID: deploymentID,
InvocationID: sql.NullString{Valid: true, String: invocation.Id()},
UpdatedAt: sql.NullInt64{Valid: true, Int64: time.Now().UnixMilli()},
}); updateErr != nil {
logger.Warn("failed to store invocation ID", "deployment_id", deploymentID, "error", updateErr)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't treat invocation-id persistence as best-effort.

If Lines 301-307 fail, the workflow is already running but this deployment can no longer be cancelled/killed later—the core capability added in this PR. This needs a retry or compensating path, not just a warning.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@svc/ctrl/services/deployment/create_deployment.go` around lines 300 - 307,
The code treats persisting the invocation ID
(db.Query.UpdateDeploymentInvocationID call using parameters ID=deploymentID and
InvocationID=sql.NullString{String: invocation.Id()}) as best-effort; instead
make persistence mandatory: after creating the workflow/invocation, if
UpdateDeploymentInvocationID returns an error, either retry the DB update a few
times with backoff or cancel/terminate the started invocation and return an
error to the caller so the deployment can be retried/rolled back. Concretely,
wrap the db.Query.UpdateDeploymentInvocationID call in a retry loop (with small
sleep/backoff) and only proceed/return success once it succeeds; if retries
exhaust, call the invocation cancel/terminate API (using invocation.Id()) and
return an error (no longer just logger.Warn) so the caller knows persistence
failed.

Comment on lines +186 to +187
time.Sleep(30 * time.Second)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Restate Go SDK durable workflows, what is the recommended way to delay execution? Is using time.Sleep inside a workflow/object handler supported, and what are the replay/cancellation implications?

💡 Result:

Recommended way to delay execution (Go SDK)

Use Restate’s durable timers, not Go’s time.Sleep:

  • restate.Sleep(ctx, d) to pause a handler durably. The timer is tracked by Restate so it survives crashes/restarts and resumes with the remaining time. During the sleep Restate can suspend the handler (freeing resources), and on retry it replays via the journal rather than redoing completed context actions. [1][2]
  • Prefer delayed messages (schedule an invocation in the future) when the handler shouldn’t stay “in-flight” during the delay—Restate explicitly recommends delayed messages over “sleep + send” for long delays / object concurrency / versioning reasons. [1]

Is time.Sleep inside a workflow/object handler supported?

It’s not the recommended/supported durable-wait mechanism in Restate handlers. time.Sleep is just a local process block: it doesn’t create a durable timer entry in Restate’s journal, doesn’t let Restate suspend the handler, and interacts poorly with replay/failures. Restate’s durability/replay is built around context actions (like sleep/timers, service calls, run blocks) being recorded and replayed from the journal. [2]

Replay + cancellation implications

  • Replay/failures: With restate.Sleep, Restate can recover correctly (e.g., if it slept 8h of a 12h timer, it will only sleep ~4h more). With time.Sleep, a crash/retry can cause the handler to sleep again from scratch, because that wait wasn’t durably recorded. [1][2]
  • Cancellation: When an invocation is canceled, Restate surfaces the cancellation as a terminal error the next time the handler awaits a Restate context Promise/Future/action (including sleep/timers). So restate.Sleep / restate.After(...).Done() will return a cancellation error promptly at the await point. A plain time.Sleep will generally not be interrupted by Restate cancellation; the handler will remain blocked until the sleep ends and only then observe anything (if it checks). [3][4]

Bottom line: use restate.Sleep/restate.After (or delayed messages) for delays; avoid time.Sleep inside workflow/object handlers. [1][3]


🏁 Script executed:

# Find the deploy_handler.go file
fd deploy_handler.go

Repository: unkeyed/unkey

Length of output: 97


🏁 Script executed:

# Check the file to confirm lines 186-187 and surrounding context
cat -n svc/ctrl/worker/deploy/deploy_handler.go | sed -n '180,195p'

Repository: unkeyed/unkey

Length of output: 632


🏁 Script executed:

# Look for the Deploy method signature and Workflow type to confirm it's a Restate handler
rg -A 5 "func.*Deploy" svc/ctrl/worker/deploy/deploy_handler.go | head -30

Repository: unkeyed/unkey

Length of output: 1322


Replace time.Sleep with restate.Sleep (or remove if debug code).

Line 186 blocks the handler with a hard 30s delay, which breaks Restate durability (crashes retry from scratch), prevents cancellation, and reduces deployment responsiveness. Replace with restate.Sleep(ctx, 30*time.Second) for a durable timer, or remove entirely if this is temporary debug code.

✂️ Minimal fix (durable)
-	time.Sleep(30 * time.Second)
+	restate.Sleep(ctx, 30*time.Second)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@svc/ctrl/worker/deploy/deploy_handler.go` around lines 186 - 187, Replace the
blocking time.Sleep call with a cancellable, durable timer by using
restate.Sleep(ctx, 30*time.Second) (or remove the sleep if it was only for
debugging); locate the blocking call in deploy_handler.go (e.g., inside the
deploy handler function where time.Sleep(30 * time.Second) is used) and swap it
to restate.Sleep(ctx, 30*time.Second) so the operation respects the passed
context and Restate retry/cancellation semantics.

Comment on lines +185 to +197
// Store invocation ID for later cancellation
invocationID := inv.GetInvocationId()
if invocationID != "" {
if storeErr := restate.RunVoid(ctx, func(runCtx restate.RunContext) error {
return db.Query.UpdateDeploymentInvocationID(runCtx, s.db.RW(), db.UpdateDeploymentInvocationIDParams{
ID: deploymentID,
InvocationID: sql.NullString{Valid: true, String: invocationID},
UpdatedAt: sql.NullInt64{Valid: true, Int64: time.Now().UnixMilli()},
})
}, restate.WithName("store invocation id")); storeErr != nil {
logger.Warn("failed to store invocation ID", "deployment_id", deploymentID, "error", storeErr)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't downgrade invocation-ID persistence to a warning.

This write is the only durable link between the deployment row and the Restate workflow. If it fails, the workflow still runs but later cancellation has no kill handle and can only mutate DB state. Please make this retryable/compensated instead of returning success here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@svc/ctrl/worker/githubwebhook/handle_push.go` around lines 185 - 197, The
persistence of invocationID (invocationID := inv.GetInvocationId()) must not be
downgraded to a warn; if UpdateDeploymentInvocationID
(db.Query.UpdateDeploymentInvocationID) fails the handler should surface/handle
the error so the workflow retains a cancel handle. Replace the logger.Warn path:
make the call retryable or return the error from the surrounding function (i.e.,
propagate the error instead of swallowing it). Use a retry/backoff loop or the
existing restate.RunVoid wrapper with retry semantics (or wrap
UpdateDeploymentInvocationID in a retry helper) and if retries ultimately fail
return an error (or abort the request) rather than logging and continuing; keep
references to invocationID, restate.RunVoid and UpdateDeploymentInvocationID to
locate the change.

Comment on lines +18 to +24
onSuccess: () => {
utils.invalidate();
try {
collection.deployments.utils.refetch();
} catch (error) {
console.error("Refetch error:", error);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scope and await post-cancel cache refresh to avoid racey/stale UI.

Line 19 invalidates the entire TRPC cache, and Line 21 refetch is not awaited before closing/toasting. This can cause unnecessary network churn and success feedback before the deployment list is actually fresh.

💡 Proposed fix
-  const cancelDeployment = trpc.deploy.deployment.cancel.useMutation({
-    onSuccess: () => {
-      utils.invalidate();
-      try {
-        collection.deployments.utils.refetch();
-      } catch (error) {
-        console.error("Refetch error:", error);
-      }
-      onClose();
-      toast.success("Deployment cancelled");
-    },
+  const cancelDeployment = trpc.deploy.deployment.cancel.useMutation({
+    onSuccess: async () => {
+      await Promise.all([
+        utils.deploy.deployment.list.invalidate({ projectId: selectedDeployment.projectId }),
+        collection.deployments.utils.refetch(),
+      ]);
+      onClose();
+      toast.success("Deployment cancelled");
+    },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@web/apps/dashboard/app/`(app)/[workspaceSlug]/projects/[projectId]/(overview)/deployments/components/table/components/actions/cancel-dialog.tsx
around lines 18 - 24, The onSuccess handler currently calls utils.invalidate()
and fires collection.deployments.utils.refetch() without awaiting it, causing
racey UI; update the handler to scope invalidation/refetch to the deployments
query (use collection.deployments.utils.invalidate() or
collection.deployments.utils.refetch()) and await the promise(s) (await
collection.deployments.utils.refetch() and await any async invalidate call)
inside a try/catch before closing the dialog or showing success to ensure the
deployments list is refreshed when the UI updates.

Comment on lines +75 to +85
await ctrl
.cancelDeployment({
deploymentId: input.deploymentId,
})
.catch((err) => {
console.error(err);
throw new TRPCError({
code: "INTERNAL_SERVER_ERROR",
message: err,
});
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep ctrl cancellation failures intact.

This catch rewrites NOT_FOUND / FAILED_PRECONDITION responses from the control plane into INTERNAL_SERVER_ERROR, and message: err can surface an unusable message. Common cancel races will read like 500s in the UI. Translate the ctrl failure instead of overwriting it here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@web/apps/dashboard/lib/trpc/routers/deploy/deployment/cancel.ts` around lines
75 - 85, The current catch on ctrl.cancelDeployment({ deploymentId:
input.deploymentId }) swallows control-plane errors and converts everything to a
generic INTERNAL_SERVER_ERROR with message: err; instead, detect and rethrow the
original error when it already represents a TRPCError (or has a code-like
property) so NOT_FOUND/FAILED_PRECONDITION are preserved, and when wrapping is
necessary create a TRPCError using a safe string message (e.g., err.message or
String(err)) rather than the raw err object; update the catch around
ctrl.cancelDeployment to check for err instanceof TRPCError (or err.code) and
throw err directly, otherwise throw new TRPCError({ code:
"INTERNAL_SERVER_ERROR", message: err.message ?? String(err) }).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant