Skip to content

Latest commit

 

History

History
303 lines (218 loc) · 6.56 KB

File metadata and controls

303 lines (218 loc) · 6.56 KB

devops-execution-engine

Professional DevOps expertise for your Clawdbot instance

Version: 1.0.0
License: Apache 2.0
Author: Clawdbot Community
Platform: Clawdbot


Overview

The DevOps Execution Engine is a comprehensive skill package that extends Clawdbot with professional-grade DevOps capabilities. It provides a structured Plan → Approve → Execute workflow for safe infrastructure operations, combined with a comprehensive library of 11 production-ready DevOps skills.

Designed specifically for Clawdbot to leverage its unique ability to actually execute commands with human oversight, not just suggest them.

Features

✅ Safe Execution

  • Plan → Approve → Execute workflow
  • Risk classification (LOW/MEDIUM/HIGH/CRITICAL)
  • Pre-flight validation
  • Rollback procedures
  • Audit logging

📚 Comprehensive Skills Library

  • Kubernetes: Debug, deploy, manage
  • AWS: Operations, cost optimization
  • Infrastructure: Terraform, ArgoCD, GitOps
  • Containers: Docker operations
  • Observability: Log analysis, system health
  • Incident Response: Structured playbooks

🔒 Safety Guarantees

  • No auto-execution without approval
  • Every action logged
  • Rollback plans included
  • Risk assessment for all operations

Installation

# Clone the repository
git clone https://github.com/clawdbot/devops-execution-engine

# Install as Clawdbot skill
clawdbot skills:install ./devops-execution-engine

# Verify installation
clawdbot skills:list | grep devops-execution-engine

Quick Start

1. Diagnose an Issue

You: Check the cluster health

Clawd: [uses system-health and k8s-debug skills]
- Nodes: 3/3 Ready
- Pods: 2 in CrashLoopBackOff
- Disk: worker-1 at 85% (warning)

2. Generate Execution Plan

You: Fix the crashloop pods

Clawd: 📋 EXECUTION PLAN: plan-20260126-003

Title: Fix CrashLoopBackOff in api-service
Risk: MEDIUM
Time: ~5min

Steps:
1. Increase memory limit 256Mi → 512Mi
2. Wait for rollout (5min timeout)
3. Verify all pods running

Rollback: kubectl rollout undo deployment/api-service

Approve? (yes/no/explain)

3. Approve & Execute

You: approve

Clawd: ✅ Executing plan-20260126-003...

Step 1/3: Patching deployment... ✓
Step 2/3: Waiting for rollout... ✓ (2m 34s)
Step 3/3: Verifying pods... ✓

✅ Plan completed successfully
All pods running, no restarts
Logged to: memory/actions-log.jsonl

Included Skills

Kubernetes

  • k8s-debug - Troubleshoot pods, deployments, nodes
  • k8s-deploy - Safe deployment workflows with rollback
  • argocd-gitops - GitOps workflows with ArgoCD

Cloud

  • aws-ops - AWS resource management and queries
  • cost-optimization - Cloud cost analysis and recommendations

Infrastructure

  • terraform-workflow - IaC workflows and best practices
  • docker-ops - Container operations and debugging

Operations

  • incident-response - Structured incident response playbooks
  • log-analysis - Cross-platform log analysis patterns
  • system-health - Quick health checks (disk, memory, CPU)
  • git-workflow - Git workflows and DevOps practices

Usage Examples

Kubernetes Debugging

"Debug the pods in production namespace"
"Why is api-service crashing?"
"Check resource usage across the cluster"

Incident Response

"We have a SEV1 - API is down"
"Run incident response for high error rates"
"Check recent deployments"

Cost Optimization

"Analyze AWS costs this month"
"Find underutilized resources"
"Suggest cost optimizations"

Deployments

"Deploy api-service v2.1.0 to production"
"Rollback the last deployment"
"Check ArgoCD sync status"

Execution Plan Format

Plans are generated as YAML in memory/execution-plans/:

plan:
  id: plan-20260126-001
  title: "Fix CrashLoopBackOff in api-service"
  risk: MEDIUM
  estimated_time: 5min
  
  rollback:
    method: "Rollback deployment to previous revision"
    commands: ["kubectl rollout undo deployment/api-service"]
  
steps:
  - action: kubectl_patch
    command: "kubectl patch deployment api-service..."
    risk: MEDIUM
    reversible: true
    
  - action: wait_for_rollout
    timeout: 5m
    success_criteria: "all pods running"
    
approval:
  required: true
  status: pending

Safety Model

Risk Levels

🟢 LOW

  • Read-only operations
  • No impact on running services
  • Auto-executable (if configured)

🟡 MEDIUM

  • Resource changes (memory, CPU limits)
  • Scaling operations
  • Non-production deployments
  • Requires approval

🔴 HIGH

  • Production deployments
  • Service restarts
  • Configuration changes
  • Requires approval + impact analysis

⛔ CRITICAL

  • Data operations
  • Security/RBAC changes
  • Namespace/resource deletion
  • Blocked by default, requires override

Approval Process

  1. Generate - Clawd creates execution plan
  2. Present - Shows summary with risk assessment
  3. Review - You examine the plan
  4. Approve - Explicit "yes", "approve", or "execute"
  5. Execute - Clawd runs steps sequentially
  6. Verify - Post-execution validation
  7. Log - Record to audit trail

Configuration

Create ~/.clawdbot/skills/devops-execution-engine/config.yaml:

# Execution engine config
execution:
  auto_approve_low_risk: false    # Auto-approve LOW risk actions
  pause_between_steps: true       # Pause after each step
  timeout_default: 300            # Default timeout (seconds)
  
# Logging
audit:
  log_path: "memory/actions-log.jsonl"
  log_level: "info"
  
# Safety
safety:
  require_approval: true
  allow_critical: false           # Block CRITICAL actions
  dry_run_by_default: false

Documentation


Contributing

Contributions welcome! See CONTRIBUTING.md

Adding Custom Skills

  1. Create skill directory in skills/
  2. Add SKILL.md with documentation
  3. Include example execution plans
  4. Submit PR

Reporting Issues

  • GitHub Issues: Bug reports and feature requests
  • Discussions: Questions and general discussion

License

Apache 2.0 - See LICENSE


Support


Built with ❤️ by the Clawdbot community