GTM Engineering
January 18, 2026
Revenue Reliability Engineering (RRE)

id: ART-0019 title: ""Scale Engineering: Revenue Reliability Engineering (RRE)"" version: v2.1 last_updated: 2025-10-25 owner: Revenue Reliability Office (RevOps Lead × SRE Lead) tags: [rre, reliability, slo, error-budgets, incidents, chaos, gtm] service_catalog_ref: ""CAT-0019-RRE-v1"" slo_set: [""P(Response<5m)≥0.95"",""P(Route<2m)≥0.99"",""P(EnrichAge<24h)≥0.98"",""P(WriteOK)≥0.999"",""Msgs/min≥200&Error%≤0.5%"",""P(Invite<2m)≥0.98"",""P(AttributionOK)≥0.995"",""P(ForecastAge<4h)≥0.98""] incident_policy_ref: ""INC-0019-Policy-v1"" chaos_suite_ref: ""CHAOS-0019-v1""
Revenue Reliability Engineering (RRE)
RRE applies Site Reliability practices to revenue services—treating lead intake, enrichment, routing, sequencing, meeting creation, CRM writes, attribution, and forecasting as production systems with contracts, SLOs, error budgets, synthetic checks, chaos tests, and incident playbooks. This article installs RRE across the GTM stack and links every service to dashboards, runbooks, and change control. Adjacent artifacts: ops execution in [[ART-0016]] and metric/decision contracts in [[ART-0020]].
Service Catalog
Each service is a first-class product with owners, dependencies, critical paths, SLOs, error budgets, dashboards, and a runbook.
SLOs and Error Budgets
Budget Policy
Windows reset every 30 days (throughput window 7 days).
Freeze-on-burn: At 100% budget consumed, halt risky changes and revert to last stable configuration; open incident if not already active.
Ownership and thresholds live in [[ART-0020]] decision/metric contracts.
Monitoring and Synthetic Tests
Probes & Contract Tests
Dashboard Map (required charts)
DB-INTAKE-RTT: p50/p90/p95 response, volume heatmap, error codes.
DB-ENRICH-FRESH: freshness distribution by field/vendor, failover rate.
DB-ROUTING-LAT/AVAIL: p50/p95 latency, availability, unassigned rate, fairness index.
DB-SEQUENCER-THRPT: msgs/min, send errors by cause, policy refusals.
DB-MEET-RTT: invite latency, no-show predictors (informational).
DB-CRM-WRITE-SUCCES
S: success rate by object, retry/dlq counts, MTTR.
DB-ATTR-VALID: contract test pass %, model drift.
DB-FCST-FRESH: last update age, pipeline deltas.
Sample Probes (fenced)
-- Routing availability (last 24h)
SELECT 1 - SUM(CASE WHEN success=false THEN 1 ELSE 0 END)::float / COUNT(*) AS availability
FROM routing_events
WHERE ts >= now() - interval '24 hours';
CRM Write synthetic probe
def probe_crm_write():
payload = {""test"": True, ""object"": ""lead"", ""email"": f""probe+{uuid4()}@example.test""}
t0 = now()
ok = crm.write(payload)
t1 = now()
record_metric(""crm_write_ok"", int(ok))
record_metric(""crm_write_ms"", (t1-t0).total_seconds()*1000)
assert ok
Decision OS Contract Test (enforced)
-- Metric integrity: required fields and enums
SELECT COUNT(*) AS violations
FROM leads l
LEFT JOIN territories t ON l.territory = t.code
WHERE l.email !~* '^[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,}$'
OR t.code IS NULL;
Incident Management
Severity Ladder
Paging & Comms Templates
[INCIDENT][SEV1] Routing latency breach. Start: 10:12 ET. Scope: NA inbound. SLO: P(Route<2m) current 0.92 (<0.99). Actions: reverted to ROUTING_V1, draining queues. Next update: +15m. Owner: @oncall-sre.
Blameless Postmortem Template
PM-0019-YYYYMMDD-
Summary
Timeline (UTC)
Impact (quantify: leads, $, SLO minutes)
Root Causes (systemic, local)
What Worked / What Didn’t
Action Items
- A1: — Owner — Due — Verification date
- A2: — Owner — Due — Verification date
Learnings Linked to [[ART-0016]] and [[ART-0020]]
Status Page & Customer Comms Checklist
Draft customer-facing note (scope, impact, workaround, next update time).
Update every 30 minutes until resolved; final “Resolved” with cause + prevention.
Resilience and Chaos
Dependency Map (high level)
Vendors: Enrichment A/B, Email API, Calendar API, Ads APIs.
<


