Search docs

Find a documentation page

Runbook: incident response

First-line operator response when sign-probe alerts fire or a lease cannot be activated.

Common symptoms

  • Sign-probe uptime drops below the lease SLO.
  • Traders report UNAVAILABLE (14) or INTERNAL (13) from the signing gRPC endpoint.
  • The marketplace dashboard flags a validator as degraded.

Emergency stop

If a validator is misbehaving and must be removed from rotation immediately:

validator-cli emergency-stop --reason "<short description>"

The CLI tears down the listening port on the host-proxy and emits an audit-log entry. The validator is delisted from the browse page within seconds. Existing leases enter a "frozen" state, and payments already collected are refunded automatically.

Triage checklist

  1. Capture the last 5 minutes of host-proxy logs from CloudWatch.
  2. Capture the enclave attestation document (validator-cli audit --since 5m).
  3. Confirm the underlying EC2 host is healthy (aws ec2 describe-instance-status).
  4. Confirm KMS key access has not been revoked (aws kms describe-key --key-id <...>).

TODO: pending content

  • Escalation contacts and PagerDuty service routing.
  • Concrete log-query snippets for the most common failure classes.
  • Post-incident review template.