Runbook: incident response
First-line operator response when sign-probe alerts fire or a lease cannot be activated.
Common symptoms
- Sign-probe uptime drops below the lease SLO.
- Traders report
UNAVAILABLE(14) orINTERNAL(13) from the signing gRPC endpoint. - The marketplace dashboard flags a validator as
degraded.
Emergency stop
If a validator is misbehaving and must be removed from rotation immediately:
validator-cli emergency-stop --reason "<short description>"
The CLI tears down the listening port on the host-proxy and emits an audit-log entry. The validator is delisted from the browse page within seconds. Existing leases enter a "frozen" state, and payments already collected are refunded automatically.
Triage checklist
- Capture the last 5 minutes of host-proxy logs from CloudWatch.
- Capture the enclave attestation document
(
validator-cli audit --since 5m). - Confirm the underlying EC2 host is healthy (
aws ec2 describe-instance-status). - Confirm KMS key access has not been revoked
(
aws kms describe-key --key-id <...>).
TODO: pending content
- Escalation contacts and PagerDuty service routing.
- Concrete log-query snippets for the most common failure classes.
- Post-incident review template.