Skip to content

Health Checks

The Exchange implements three complementary health check mechanisms:

LayerProtocolPurposeConsumer
gRPC Healthgrpc.health.v1.Health/CheckPer-service health for gRPC-native probesKubernetes gRPC probes, Broker
HTTP LivenessGET /healthzProcess alive, no external I/OKubernetes livenessProbe
HTTP ReadinessGET /readyzAll subsystems ready to serve trafficKubernetes readinessProbe, load balancers

The gRPC Health layer is the primary probe mechanism for the Broker. HTTP endpoints exist for Kubernetes and non-gRPC consumers. All three layers derive status from the same internal subsystem health matrix.

Each subsystem is classified as relevant to liveness, readiness, or both. Background goroutines probe subsystems and cache their status — health endpoints read cached status only, with no synchronous I/O.

SubsystemLiveness?Readiness?Rationale
Process aliveYesYesBase signal
Catalog loadedNoYesCannot serve offers without a catalog
Catalog age within thresholdNoYesStale catalog means ingestion is stuck
WAL writableNoYesCannot record transactions
WAL utilization below 90%NoYesPrevents transaction loss under pressure
Billing adapter reachableNoYesCannot authorize purchases
Signing keys loadedNoYesCannot sign offers or generate signed URLs
SubsystemIntervalMechanism
Catalog loaded5sCheck in-memory catalog pointer
Catalog age5sCompare timestamp to 2x poll interval
WAL writable10sNo-op write attempt
WAL utilization10sSize / max capacity
Billing adapter15sTCP connect
Signing keys30sCheck key ring for valid keys

Returns 200 OK with body ok if the process is alive. No external I/O, no subsystem checks, no JSON body.

Why no external I/O: Liveness probes that call external services can cause cascading failures. If a dependency slows down, the liveness probe times out, Kubernetes restarts the pod, which increases load on remaining pods, causing more restarts. The /healthz endpoint avoids this by checking only that the process can handle HTTP requests.

Returns 200 OK with subsystem detail when all readiness subsystems are healthy. Returns 503 Service Unavailable with detail when any subsystem is unhealthy.

{
"ready": true,
"subsystems": [
{"name": "catalog_loaded", "healthy": true, "message": "2847 entries"},
{"name": "catalog_age", "healthy": true, "message": "age: 2m14s (max: 10m)"},
{"name": "wal_writable", "healthy": true, "message": "write latency: 0.4ms"},
{"name": "wal_utilization", "healthy": true, "message": "42% (limit: 90%)"},
{"name": "billing_adapter", "healthy": true, "message": "reachable: 1.2ms"},
{"name": "signing_keys", "healthy": true, "message": "2 active keys"}
],
"checked_at": "2025-01-15T10:30:01Z"
}
containers:
- name: exchange
ports:
- containerPort: 8080
name: grpc
startupProbe:
httpGet:
path: /readyz
port: 8080
periodSeconds: 10
failureThreshold: 30 # 5 minutes for first catalog build
readinessProbe:
grpc:
port: 8080
service: "ramp.v1.ExchangeService"
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 20
failureThreshold: 3 # 60s before restart
ProbeTypeRationale
startupProbeHTTP /readyzFirst catalog build can take minutes. HTTP gives subsystem detail for debugging slow starts
readinessProbegRPC ramp.v1.ExchangeServiceOnce started, use gRPC-native probing. Removes pod from Service endpoints when unhealthy
livenessProbeHTTP /healthzLightweight — no external I/O. Detects deadlocked processes

The Broker’s Exchange Registry uses grpc.health.v1.Health/Check instead of naive HEAD probes. This enables structured routing decisions:

ScenarioHEAD ProbegRPC HealthBroker Action
Catalog stale but billing works200 (looks healthy)NOT_SERVINGSkip for discovery
Billing down but catalog fresh200 (looks healthy)NOT_SERVINGCan serve read-only, not transactions
Process alive, nothing ready200 (looks healthy)NOT_SERVINGSkip entirely
Process deadConnection refusedConnection refusedMark unhealthy

With HEAD probes, scenarios 1-3 are indistinguishable. With gRPC health, the Broker knows the Exchange is degraded and routes accordingly.