Quick Definition (30–60 words)
RED metrics are three operational metrics — Rate, Errors, and Duration — used to monitor service health and performance. Analogy: like a car dashboard showing speed, engine faults, and fuel usage. Formal: a minimal SRE observability pattern mapping request throughput, failure rate, and latency as SLIs for distributed cloud services.
What is RED metrics?
RED metrics is an observability pattern focused on three core signals for request-driven services: Rate, Errors, and Duration. It is a pragmatic approach favored by SREs and cloud-native teams to quickly detect and triage service degradations without drowning in irrelevant metrics.
What it is / what it is NOT
- What it is: A focused SLI set for request-oriented systems that helps prioritize alerts and debugging effort.
- What it is NOT: A complete observability solution; it does not replace business metrics, detailed instrumentation, or security telemetry.
Key properties and constraints
- Minimal: three signals to reduce noisy alerts.
- Request-centric: best for synchronous request/response services.
- Aggregation-sensitive: requires careful labeling and cardinality control.
- Fast feedback loop: supports alerting and on-call actions.
Where it fits in modern cloud/SRE workflows
- SLO definition: RED metrics map to SLIs that feed SLOs and error budgets.
- Incident response: first triage layer to detect whether problems are throughput, failure, or latency related.
- CI/CD and release validation: used in canaries, rollouts, and automated rollbacks.
- Automation/AI ops: feeds anomaly detection models and automated remediation runbooks.
A text-only “diagram description” readers can visualize
- Clients send requests to edge load balancer; requests routed to service instances; exporter instruments requests to produce Rate, Error flag, and Duration; metrics aggregated by metrics pipeline; alerting/AI-runbooks subscribe and trigger alerts or automation; dashboards for ops and execs summarize three curves with drill-in to logs, traces, and resource telemetry.
RED metrics in one sentence
RED is a minimal set of request-centric SLIs — Rate, Errors, and Duration — used to detect and triage service degradations and feed SLO-driven operations.
RED metrics vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from RED metrics | Common confusion |
|---|---|---|---|
| T1 | SLIs | SLIs are any service indicators while RED is a focused SLI pattern | People think SLIs must be RED only |
| T2 | SLOs | SLOs are objectives applied to SLIs; RED supplies candidate SLIs | Confusing SLO policy with metric collection |
| T3 | SLAs | SLAs are legal contracts; RED aids monitoring not contractual terms | Assuming RED equals SLA coverage |
| T4 | Four Golden Signals | Four Golden Signals include saturation and latency; RED omits saturation | Thinking RED is complete as Golden Signals |
| T5 | APM traces | Traces show request paths; RED are aggregated numeric signals | Belief that traces replace RED |
| T6 | Business metrics | Business metrics measure outcomes; RED measures system health | Equating RED drops with revenue drops |
| T7 | Saturation metrics | Saturation is resource utilization; RED focuses on requests | Overusing CPU as primary RED signal |
| T8 | Error budgets | Error budgets consume SLO violations; RED supplies the error SLI | Assuming errors alone capture budget burn |
| T9 | Heartbeat metrics | Heartbeats check liveness; RED captures request behavior | Treating heartbeat up as healthy without RED checks |
| T10 | Chaos experiments | Chaos tests resilience; RED measures results during experiments | Believing chaos replaces continuous RED monitoring |
Why does RED metrics matter?
Business impact (revenue, trust, risk)
- Revenue protection: quick detection of increased error rates or latency prevents transactional loss.
- Customer trust: stable response times preserve user experience and retention.
- Risk reduction: clear SLIs enable contractual compliance and reduce legal exposure.
Engineering impact (incident reduction, velocity)
- Faster triage: triage using three signals focuses root cause search.
- Reduced alert fatigue: targeted alerts reduce noisy paging and improve signal-to-noise.
- Faster deployments: canary and automated rollback rely on RED-based SLOs to safely increase velocity.
SRE framing (SLIs/SLOs/error budgets/toil/on-call)
- SLIs: RED metrics provide canonical SLI candidates for request services.
- SLOs: set latency and error targets with rolling windows to control error budgets.
- Error budgets: drive release policies, automated rollbacks, and prioritization.
- Toil reduction: automations can be triggered when RED patterns are recognized; SREs can focus on long-term reliability work.
- On-call: clear RED alerts map to specific runbooks instead of vague paging.
3–5 realistic “what breaks in production” examples
- Sudden spike in Duration due to a third-party API slowdown causing backlog and eventual timeouts.
- Increased Errors caused by a recent deployment with an incorrect feature flag logic path.
- Drop in Rate because of traffic routing misconfiguration at the ingress or CDN.
- Intermittent errors due to resource exhaustion on a subset of nodes because of a memory leak.
- Latency tail increases because of noisy neighbors in a multi-tenant serverless platform.
Where is RED metrics used? (TABLE REQUIRED)
| ID | Layer/Area | How RED metrics appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge / CDN | Rate = requests per second at edge | Request count, edge latency, TLS errors | Metrics pipeline, CDN dashboards |
| L2 | Network / Load balancer | Rate and Errors for connectivity | LB metrics, connection errors, 5xx | Cloud LB metrics, Prometheus |
| L3 | Service / Application | Core RED at service level | Request count, response codes, latency histograms | Tracing, Prometheus, APM |
| L4 | Data / DB layer | Duration and Errors for DB calls | Query latency, error counts, saturation | DB monitoring, tracing |
| L5 | Kubernetes | RED for pods and services in cluster | Pod request counts, pod-level latency | Prometheus, kube-state-metrics |
| L6 | Serverless / FaaS | Invocations as Rate, failures as Errors | Invocation count, cold-start latency | Managed monitoring, logs |
| L7 | CI/CD / Canary | RED as canary health signals | Canary metrics, deployment timestamps | CD system metrics, monitoring hooks |
| L8 | Incident Response | RED drives incident priorities | Aggregated RED trends, correlated logs | Incident platforms, alerting systems |
| L9 | Security / DoS detection | Unusual Rate spikes flagged as security alarms | High request rates, error patterns | WAF, security monitoring |
When should you use RED metrics?
When it’s necessary
- For request-oriented services handling user or API traffic.
- When you need fast triage signals for on-call teams.
- To feed SLOs and automate release gating.
When it’s optional
- For non-request-driven systems like batch jobs, sensor pipelines, or streaming jobs where other metrics apply.
- When business metrics are the primary focus and service metrics are lower priority.
When NOT to use / overuse it
- Not appropriate as the sole observability for background jobs, uninstrumented pipelines, or purely event-sourced backends.
- Avoid using RED for internal library functions or excessively high-cardinality dimensions.
Decision checklist
- If service is synchronous and request-driven AND supports SLIs -> implement RED.
- If service is asynchronous batch-oriented -> consider job-oriented SLIs instead.
- If you need contract-level guarantees -> pair RED with business SLIs and SLAs.
Maturity ladder
- Beginner: instrument request count, status codes, and mean latency for top-level endpoints.
- Intermediate: add latency histograms, per-endpoint SLIs, and basic SLOs with alerts.
- Advanced: per-user or per-tenant SLOs, adaptive alerting with burn-rate, AI-assisted anomaly detection, auto-rollbacks, and security correlation.
How does RED metrics work?
Components and workflow
- Instrumentation: apps emit request metrics (counter, error counter, latency histogram) with stable labels.
- Metrics pipeline: scrapers/exporters collect metrics, funnel into aggregation and storage.
- Aggregation: compute SLI windows (e.g., 5m, 1h, 28d) and percentiles.
- Alerting/evaluation: compare against SLOs and run error budget policies.
- Triage: dashboards and traces provide drill-down; automation may trigger rollbacks.
Data flow and lifecycle
- Generation: app records metric per request.
- Collection: agent or library pushes/pulls to observability backend.
- Retention: short and long-term retention balanced for incident triage and trend analysis.
- Consumption: dashboards, alerting rules, SLO controllers, and automation consume aggregated SLI values.
Edge cases and failure modes
- Cardinality storms: too many label values create high cardinality and cost.
- Metric loss: buffering/backpressure can drop metrics in outages.
- Aggregation distortions: misconfigured histograms or incorrect units mislead SLOs.
- Sampling pitfalls: aggressive sampling may hide error hotspots.
Typical architecture patterns for RED metrics
- Sidecar metrics exporter: per-pod sidecar collects local metrics and forwards to Prometheus remote write; good for Kubernetes reliability.
- Library instrumentation with OpenTelemetry: standardized SDKs inside services reporting counters and histograms; best for multi-platform consistency.
- Edge-first instrumentation: capture RED at API gateway/ingress for uniformity across downstream services; useful for zero-instrumentation services.
- Serverless integrated metrics: rely on managed platform metrics augmented by function-level instrumentation; best for FaaS.
- Hybrid pipeline with processing: metrics collected centrally, enriched with traces and logs in processing cluster, then written to long-term store; suitable for large organizations with AI/automation.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | High cardinality | Metrics store spikes and queries slow | Too many label values | Reduce labels, use hashing | Increased scrape failures |
| F2 | Metric loss | Missing data in dashboards | Network/agent failure | Buffering, redundant scraping | Gaps in time series |
| F3 | Incorrect units | Misleading SLO breaches | Wrong instrumentation units | Standardize SDKs, code reviews | Strange percentile values |
| F4 | Sampling hiding errors | No alerts despite problems | Aggressive sampling | Reduce sampling for error cases | Low error counts but high user complaints |
| F5 | Histogram misconfig | Latency percentiles wrong | Poor bucket choices | Reconfigure histograms | Percentiles inconsistent with traces |
| F6 | Aggregation lag | Delayed alerts | Ingestion pipeline backlog | Scale pipeline, increase retention | Alert delays and backlog metrics |
Row Details (only if needed)
- None
Key Concepts, Keywords & Terminology for RED metrics
- Request — A client-initiated operation — The primary unit RED measures — Mistaking background jobs for requests.
- Throughput — Requests per second — Indicates load and capacity needs — Confusing burst vs sustained.
- Rate — Count of requests over time — Core RED R — Using raw counters without rate normalization.
- Error rate — Fraction of requests that fail — Core RED E — Not all 5xx are equivalent business errors.
- Latency — Time to complete a request — Core RED D — Using averages hides tail behavior.
- Duration histogram — Bucketed distribution of latencies — Enables percentile SLIs — Wrong buckets distort percentiles.
- P95/P99 — 95th/99th percentile latency — Tail performance metric — Over-emphasizing p99 leads to wasted effort.
- SLI — Service Level Indicator — Measurable signal used for reliability — Choosing bad SLIs breaks SLOs.
- SLO — Service Level Objective — Target for an SLI — Over-ambitious SLO causes constant alerts.
- SLA — Service Level Agreement — Contractual promise often with penalties — SLA needs more than RED.
- Error budget — Allowance for SLO breaches — Drives release policy — Not tracking burn causes surprises.
- Burn rate — Speed of consuming error budget — Used for automated response — Mis-calculating window causes false alarms.
- Instrumentation — Code that records metrics — Foundation of RED — Inconsistent instrumentation yields noise.
- Observability pipeline — Transport and storage of telemetry — Critical for signal integrity — Single point of failure risk.
- Prometheus exposition — Common scraping model — Works well for cloud-native — Pull model limitations with serverless.
- OpenTelemetry — Standard instrumentation telemetry API — Enables portability — SDK complexity can lead to fragmentation.
- Remote write — Sending metrics to external store — Enables scaling and AI processing — Adds latency in alerts.
- Cardinality — Number of unique metric label combinations — Affects cost and performance — High cardinality can break backends.
- Label — A metric dimension like endpoint or region — Key for slicing metrics — Over-labeling creates cardinality issues.
- Aggregation window — Time window for SLI computation — Determines sensitivity — Very short windows cause noise.
- Percentile — Value below which X% of samples fall — Useful for tail latency — Misinterpreting percentiles leads to wrong fixes.
- Histogram — Structure to collect distribution data — Enables accurate percentiles — Incorrect boundaries invalidates SLOs.
- Counter — Monotonic incrementing metric — Used for rate and errors — Reset behaviors must be handled.
- Gauge — Metric that can go up and down — Used for current state like concurrency — Not typically part of RED.
- Trace — Distributed record of a single request path — Used to debug RED anomalies — Traces sampled may miss edge cases.
- Log — Text record of system events — Complementary to RED for detailed debugging — Unstructured logs hinder automation.
- Canary — Small controlled deployment to test changes — RED metrics are excellent canary health signals — Canaries require realistic traffic.
- Auto-rollback — Automated rollback triggered by SLO breach — Reduces incident blast radius — Must be carefully tuned to avoid flapping.
- Anomaly detection — Statistical or ML-based change detection — Helps find subtle RED deviations — False positives are common without tuning.
- Alert threshold — Value that triggers alerting — Central to operational signal — Bad thresholds cause pager fatigue.
- Deduplication — Grouping similar alerts — Lowers noise — Over-deduping hides distinct issues.
- Correlation — Linking RED signals to logs/traces — Speeds triage — Correlation errors waste time.
- Cardinality budget — Policy limiting label counts — Protects backend costs — Strict budgets can reduce diagnostic granularity.
- Tail latency — Latency experienced by worst X percent — Business-critical for UX — Fixing tail often more costly.
- Resource saturation — CPU/memory limits reached — Not part of RED but correlates — Ignoring saturation leads to repeated incidents.
- Backpressure — Downstream overload propagating upstream — Causes increased latency and errors — Requires circuit breakers.
- Circuit breaker — Failure containment pattern — Prevents cascading failures — Wrong thresholds cause unnecessary failures.
- Rate limiting — Throttling traffic to protect services — Impacts Rate metric intentionally — Should be visible in metrics.
- Service mesh — Infrastructure layer for service-to-service comms — Adds observability hooks for RED — Sidecar overhead and complexity.
How to Measure RED metrics (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Rate RPS | Throughput and load | Count requests per second aggregated by endpoint | Varies by service scale | Include retries inflates count |
| M2 | Successful rate | Business success fraction | 1 – errors / total over window | 99% or depends on SLAs | Define what counts as error |
| M3 | Error rate | Failure portion of requests | Count error status codes / total | 0.1% to 1% starting | Transient errors may spike |
| M4 | Latency p95 | Tail latency user sees | Histogram p95 over 5m | 200ms or service dependent | Use correct histogram buckets |
| M5 | Latency p99 | Severe tail latency | Histogram p99 over 5m | 500ms or as needed | p99 noisy at low traffic |
| M6 | Latency median | Typical response time | Histogram p50 | Use to track average perf | Median hides tail issues |
| M7 | Request duration histogram | Latency distribution shape | Record duration histograms in ms | Bucketed to capture tails | Wrong bucket ranges break percentiles |
| M8 | Timeouts | Requests timed out | Count of client-side or server timeouts | Keep near zero | Proxy vs app timeouts differ |
| M9 | Throttled rate | Rate limited requests | Count of 429 or custom throttle events | Track for capacity planning | Throttles may be normal behavior |
| M10 | Request concurrency | Active requests at time | Gauge of concurrent requests | Use to detect saturation | Must be sampled accurately |
Row Details (only if needed)
- None
Best tools to measure RED metrics
Tool — Prometheus
- What it measures for RED metrics: Counters, histograms, and gauges for services; collects and stores time series.
- Best-fit environment: Kubernetes and cloud-native environments.
- Setup outline:
- Instrument apps with client libraries.
- Expose /metrics endpoints.
- Configure Prometheus scrape jobs.
- Use histogram and summary types appropriately.
- Strengths:
- Wide ecosystem and alerting rules.
- Native histogram support for percentiles.
- Limitations:
- Scaling and long-term storage needs remote write.
- Cardinality-sensitive.
Tool — OpenTelemetry + Collector
- What it measures for RED metrics: Standardized metrics export, traces, and logs correlated.
- Best-fit environment: Multi-platform observability and vendor portability.
- Setup outline:
- Instrument apps with OT SDKs.
- Configure collector pipelines.
- Export to metrics backends or APM.
- Strengths:
- Vendor-agnostic and extensible.
- Cross-signal correlation.
- Limitations:
- Complexity in configuration and sampling choices.
Tool — Managed APM (various vendors)
- What it measures for RED metrics: Request throughput, errors, latency, traces out of the box.
- Best-fit environment: Teams seeking easy trace-backed RED with minimal ops.
- Setup outline:
- Install agent or library.
- Configure service names and environment tags.
- Enable sampling and error capture.
- Strengths:
- Fast time-to-value and UI for traces.
- Limitations:
- Cost at scale and possible vendor lock-in.
Tool — Cloud Provider Metrics (built-in)
- What it measures for RED metrics: Edge and managed service request counts and latencies.
- Best-fit environment: Serverless and managed-PaaS services.
- Setup outline:
- Enable monitoring in platform console.
- Export metrics to enterprise telemetry if needed.
- Strengths:
- Low instrumentation effort for managed services.
- Limitations:
- Limited customization and retention.
Tool — Tracing backends (Jaeger/Zipkin)
- What it measures for RED metrics: Provides traces to investigate duration and error distributions by path.
- Best-fit environment: Distributed systems needing per-request path visibility.
- Setup outline:
- Instrument with tracing library.
- Collect spans and set sampling rates.
- Link traces to metrics and logs.
- Strengths:
- Deep path-level visibility.
- Limitations:
- Requires sampling strategy and storage planning.
Recommended dashboards & alerts for RED metrics
Executive dashboard
- Panels:
- Service-level RPS trend (1h, 24h): shows traffic changes.
- Error rate aggregated across services: business-level health.
- p95 latency for key customer-facing endpoints: UX signal.
- Error budget consumption: business decision signal.
- Why: high-level stakeholders need health and risk view.
On-call dashboard
- Panels:
- Real-time RPS and error rate per service.
- p50/p95/p99 latency panels.
- Error waterfall by status code and endpoint.
- Recent alerts and ongoing incidents.
- Why: fast triage and root cause isolation.
Debug dashboard
- Panels:
- Per-instance request rate, CPU, memory.
- Latency heatmap by endpoint and region.
- Trace sampling of recent errors.
- DB call latency and error breakdown.
- Why: allows deep investigation to craft fixes.
Alerting guidance
- What should page vs ticket:
- Page for SLO burn-rate thresholds and sustained error rate above critical thresholds.
- Ticket for short-lived spikes or non-urgent degradation that requires scheduled work.
- Burn-rate guidance:
- If burn rate > 4x and projected to exhaust budget within 24h -> page.
- If burn rate between 1–4x -> create ticket and notify owners.
- Noise reduction tactics:
- Deduplicate similar alerts by grouping labels.
- Suppress during known maintenance windows.
- Use dynamic thresholds based on moving windows and baseline models.
Implementation Guide (Step-by-step)
1) Prerequisites – Ownership defined for services and metrics. – Instrumentation libraries chosen (OpenTelemetry recommended). – Metrics backend and retention policy selected.
2) Instrumentation plan – Identify top-level endpoints and client-facing operations. – Define stable labels (service, endpoint, region, tenant). – Add counters for requests and errors, histograms for duration. – Ensure semantic conventions across services.
3) Data collection – Deploy collectors or configure scraping. – Set retention windows for short-term (90d) and long-term (1+ year) as needed. – Monitor pipeline health metrics for ingestion lag.
4) SLO design – Choose SLIs from RED metrics per service or endpoint. – Calculate SLO windows and error budget sizes. – Create burn-rate and alerting policies.
5) Dashboards – Build executive, on-call, and debug dashboards. – Add drill-down links to traces and logs.
6) Alerts & routing – Implement alert rules using correct aggregation windows. – Configure routing to on-call teams and escalation policies.
7) Runbooks & automation – Create runbooks for top RED alerts mapping to troubleshooting steps. – Automate common remediation where safe (e.g., restart, rollback).
8) Validation (load/chaos/game days) – Run load and chaos experiments to ensure RED metrics surface issues. – Validate alerting and automation triggers.
9) Continuous improvement – Review SLO breaches in postmortems and iterate instrumentation. – Monitor cardinality and cost, adjust labeling.
Pre-production checklist
- Instrumentation present for top endpoints.
- Local and staging metrics pipelines validated.
- Alerts and dashboards exist and tested with synthetic traffic.
- Runbooks drafted for high-severity RED alerts.
Production readiness checklist
- Metrics retention and storage capacity verified.
- Alert routing and escalation configured.
- Error budget policies and automated responses in place.
- Observability health dashboards show no ingestion gaps.
Incident checklist specific to RED metrics
- Confirm metric ingestion and timestamps.
- Check global vs service-level rate changes.
- Drill into traces for high-latency or error paths.
- Validate recent deployments or config changes.
- Execute runbook steps and, if necessary, trigger rollback.
Use Cases of RED metrics
1) Canary deployments – Context: New release gradually rolled out. – Problem: Regressions may affect availability. – Why RED helps: Canary RED signals quickly detect regressions. – What to measure: Error rate, p95 latency for canary vs baseline. – Typical tools: CD pipeline, Prometheus, alerting.
2) Multi-tenant fairness – Context: Tenant impact due to noisy neighbor. – Problem: One tenant increases latency for others. – Why RED helps: Per-tenant Rate and Duration reveal noisy tenants. – What to measure: Per-tenant error rate and p99 latency. – Typical tools: Instrumentation with tenant label, analytics.
3) Third-party API failure – Context: Downstream API slows or errors. – Problem: Cascading latency and errors. – Why RED helps: Duration spikes and increased timeouts expose the issue. – What to measure: Downstream call duration and error counts. – Typical tools: Tracing, metrics, circuit breaker logs.
4) Autoscaling tuning – Context: Under/over-provisioned services. – Problem: Latency under high load or wasted resources. – Why RED helps: Concurrency and latency patterns guide scaling. – What to measure: RPS, concurrency, p95 latency, CPU. – Typical tools: Metrics, autoscaler.
5) Serverless cold-start detection – Context: Increased latency due to cold starts. – Problem: Bad UX and SLAs missed. – Why RED helps: Duration distribution shows cold start tail. – What to measure: Invocation duration histogram, per-runtime warm metric. – Typical tools: Cloud metrics, function logs.
6) Incident prioritization – Context: Multiple alerts during an outage. – Problem: Prioritization is unclear. – Why RED helps: Aggregate error rate and traffic determine severity. – What to measure: Global error rate, top offending endpoints. – Typical tools: Incident platform, dashboards.
7) Feature launch monitoring – Context: New feature rollout to users. – Problem: Feature causes slowdowns. – Why RED helps: Focus on endpoints impacted by feature. – What to measure: Rate and latency for new endpoints. – Typical tools: Telemetry, feature flag metrics.
8) Cost-performance trade-offs – Context: Need to balance latency vs spend. – Problem: Overprovisioned resources. – Why RED helps: Identify acceptable SLOs to lower costs. – What to measure: Latency percentiles vs instance counts. – Typical tools: Cloud metrics, cost analytics.
9) Abuse detection – Context: Unexpected traffic patterns. – Problem: DoS or scraping impacting service. – Why RED helps: Sudden Rate spikes trigger security alerts. – What to measure: Rate by IP, error rates, unusual patterns. – Typical tools: WAF, edge metrics.
10) Compliance and reporting – Context: Regulatory obligations for uptime. – Problem: Need auditable SLOs. – Why RED helps: Provides measurable SLIs for compliance. – What to measure: Error budgets, SLO compliance reports. – Typical tools: Observability platform with reporting.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes: P99 latency spike during traffic surge
Context: A web service running on Kubernetes experiences a p99 latency spike during a marketing campaign. Goal: Detect and mitigate latency to meet SLOs. Why RED metrics matters here: RED quickly shows duration tail increasing and whether errors or rate also changed. Architecture / workflow: Ingress -> Service -> Pods -> DB. Prometheus scrapes pod metrics; traces via OpenTelemetry. Step-by-step implementation:
- Ensure service emits histograms and error counters.
- Scrape pod metrics with Prometheus.
- Dashboard shows p50/p95/p99 by pod and endpoint.
- Alert on p99 > SLO for 5m with burn-rate check.
- On alert, runbook: check pod CPU/memory, check DB latency, review recent deploys, scale up if needed. What to measure: p99, error rate, pod CPU, DB query latency. Tools to use and why: Prometheus for metrics, Jaeger for traces, kube-state-metrics for pod health. Common pitfalls: High-cardinality labels per request causing Prometheus overload. Validation: Load test with synthetic traffic mimicking campaign; verify alerts and auto-scale. Outcome: Root cause identified as DB index missing; fix applied and latency reduced.
Scenario #2 — Serverless/managed-PaaS: Cold-start & third-party latency
Context: Function-based API shows intermittent high latencies for a subset of invocations. Goal: Reduce tail latency and identify cold-start contribution. Why RED metrics matters here: Duration histograms differentiate cold starts vs warmed invocations. Architecture / workflow: API Gateway -> Function runtime -> Downstream service. Cloud metrics plus custom telemetry. Step-by-step implementation:
- Instrument function start and handler durations.
- Emit tag for cold-start boolean.
- Monitor p95/p99 for both cold and warm invocations.
- Alert if cold-start p99 > threshold or overall p99 increases. What to measure: Invocation count, cold-start count, p95/p99 durations. Tools to use and why: Cloud provider metrics for invocations, OpenTelemetry in function for custom labels. Common pitfalls: Relying solely on provider metrics that aggregate cold/warm together. Validation: Simulated traffic with varying concurrency to induce cold starts. Outcome: Adjusted provisioned concurrency reducing cold-start tail; downstream timeouts handled by retries to minimize errors.
Scenario #3 — Incident-response/postmortem: Regression after deploy
Context: After deploy, customers reported errors; service did not automatically roll back. Goal: Use RED metrics in postmortem to trace the regression and improve automation. Why RED metrics matters here: Error rate increase and rate drop indicate the deployment caused regressions and potential routing issues. Architecture / workflow: CD pipeline -> Kubernetes -> Service. Metrics and traces captured across deployment. Step-by-step implementation:
- Check RED trend around deploy timestamp.
- Correlate errors with deployment metadata and trace spans.
- Identify failing endpoint and offending feature flag.
- Create rollback automation triggered by error budget burn > threshold. What to measure: Error rate delta, per-deploy error attribution. Tools to use and why: CI/CD pipeline hooks, SLO controller to compute burn rate. Common pitfalls: Lack of deployment tagging in metrics making correlation slow. Validation: Replay synthetic deploy in staging with same traffic profile. Outcome: Automated rollback policy enacted and deploy pipeline revised to include canaries.
Scenario #4 — Cost/performance trade-off: Autoscaling vs latency
Context: Need to reduce cloud spend while keeping latency SLO. Goal: Find optimal autoscale thresholds. Why RED metrics matters here: Correlating RPS, concurrency, and latency enables cost-effective scaling. Architecture / workflow: Load balancer -> services -> autoscaler based on CPU or custom metric. Step-by-step implementation:
- Collect RED plus concurrency and resource metrics.
- Run controlled load tests to map latency vs instance count.
- Set autoscaler on custom metric tied to latency thresholds.
- Monitor error budgets to ensure SLOs maintain. What to measure: RPS, p95/p99 latency, instance count, cost per interval. Tools to use and why: Metrics pipeline, cost analytics, autoscaler controllers. Common pitfalls: Autoscaler reacting to CPU not reflective of request wait time. Validation: Gradual traffic ramp and rollback if SLOs breach. Outcome: Reduced spend with controlled latency within SLO.
Common Mistakes, Anti-patterns, and Troubleshooting
- Symptom: No alerts despite user complaints -> Root cause: Metrics not instrumented at entry point -> Fix: Add request-level instrumentation and synthetic checks.
- Symptom: High query cost and slow dashboards -> Root cause: High cardinality labels -> Fix: Enforce cardinality budget and reduce labels.
- Symptom: Alerts during deploy window only -> Root cause: No suppression for deploys -> Fix: Add maintenance windows or automated suppression tied to deployments.
- Symptom: p95 stable but users complain -> Root cause: p99 tail issues ignored -> Fix: Add p99 and tail histogram monitoring.
- Symptom: Error SLI drops but business metrics unaffected -> Root cause: Counting non-business-impacting errors -> Fix: Define business-error classification.
- Symptom: Tracing shows missing spans -> Root cause: Sampling too aggressive -> Fix: Adjust sampling to capture error paths.
- Symptom: Metrics gaps during outage -> Root cause: Collector outage -> Fix: Redundant collectors and buffering.
- Symptom: Over-alerting -> Root cause: Short aggregation windows and low thresholds -> Fix: Increase windows and use burn-rate logic.
- Symptom: Alerts not actionable -> Root cause: Poorly written runbooks -> Fix: Update runbooks with clear triage steps.
- Symptom: Misleading percentiles -> Root cause: Wrong histogram buckets -> Fix: Reconfigure instrument buckets and backfill if possible.
- Symptom: Unexpected rate spike -> Root cause: Misrouted traffic or bot abuse -> Fix: Rate-limiting, WAF rules, and traffic analysis.
- Symptom: Metrics explode after feature launch -> Root cause: Per-user labels causing cardinality -> Fix: Aggregate to tenant or bucketing.
- Symptom: Slow cross-service debugging -> Root cause: Lack of trace context propagation -> Fix: Ensure trace IDs propagate via headers.
- Symptom: Error budget burn unnoticed -> Root cause: No SLO controller -> Fix: Implement SLO monitoring and burn notifications.
- Symptom: Incidents reoccur -> Root cause: No postmortem action items or measurement -> Fix: Enforce postmortem and track remediation via metrics.
- Symptom: Resource saturation not linked to increased latency -> Root cause: Missing saturation metrics -> Fix: Add CPU, memory, queue depth metrics correlated with RED.
- Symptom: Alerts fired for known maintenance -> Root cause: No alert suppression -> Fix: Integrate deploy signals with alerting to suppress expected spikes.
- Symptom: Slow query in DB causing latency -> Root cause: Unoptimized queries -> Fix: Add DB monitoring and trace DB spans for slow queries.
- Symptom: Blind spots in serverless -> Root cause: Relying only on provider aggregates -> Fix: Add function-level instrumentation and custom labels.
- Symptom: Incorrect error classification -> Root cause: Counting 3xx or acceptable redirects as errors -> Fix: Define clear error mapping.
- Symptom: Observability pipeline cost blows up -> Root cause: Uncontrolled retention and cardinality -> Fix: Apply retention policies and summarize high-resolution data.
- Symptom: SLO alerts flood pager during traffic surge -> Root cause: Static thresholds not adaptive -> Fix: Use burn-rate and adaptive thresholds.
- Symptom: Misinterpretation of rate changes -> Root cause: Retry storms inflate Rate -> Fix: Track retry counts separately.
- Symptom: Debugging slow due to lack of dashboards -> Root cause: Missing on-call dashboard -> Fix: Build targeted dashboards for common incidents.
Best Practices & Operating Model
Ownership and on-call
- Define metric ownership per service and a primary SLO owner.
- On-call responsibilities include monitoring SLOs and executing runbooks.
Runbooks vs playbooks
- Runbooks: deterministic steps for known alerts.
- Playbooks: higher-level guidance for novel incidents; include escalation paths.
Safe deployments (canary/rollback)
- Use canaries with RED SLI comparison to baseline.
- Automate rollback on burn-rate thresholds and sustained SLO violation.
Toil reduction and automation
- Automate diagnostics for common RED alerts (e.g., gather top traces and DB slow queries).
- Use templates and runbook automation to reduce manual steps.
Security basics
- Monitor unusual Rate spikes or unusual error patterns as security signals.
- Ensure metrics and observability pipelines are access controlled and encrypted.
Weekly/monthly routines
- Weekly: review SLO burn and top alerts, inspect cardinality budget.
- Monthly: review instrumentation coverage and runbook accuracy, cost review for telemetry.
What to review in postmortems related to RED metrics
- Were RED signals sufficient to detect the incident?
- Was instrumentation adequate to diagnose root cause?
- Did SLOs and alerting thresholds operate as intended?
- Action items for improving metrics, dashboards, and automation.
Tooling & Integration Map for RED metrics (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Metrics store | Stores time series metrics | Scrapers, exporters, APM | Choose retention and scale strategy |
| I2 | Tracing backend | Stores distributed traces | OpenTelemetry, APM | Correlates duration and errors |
| I3 | Alerting system | Evaluates rules and routes alerts | Pager, incident platform | Integrate deployment signals |
| I4 | Visualization | Dashboards and panels | Metrics store, traces | Separate exec and on-call views |
| I5 | Collector | Aggregates telemetry | SDKs and agents | Central point to enforce policies |
| I6 | SLO controller | Calculates SLOs and burn-rate | Metrics store, alerting | Drives automated actions |
| I7 | CI/CD integration | Emits deployment metadata to telemetry | CD tools, metrics | Enables alert suppression during deploys |
| I8 | Security monitoring | Uses RED signals for anomaly detection | WAF, SIEM | Correlate with access logs |
| I9 | Cost analytics | Maps telemetry to cost | Cloud billing data | Optimize telemetry spend |
| I10 | Policy engine | Enforces cardinality budgets | Collector, CI checks | Prevents runaway metrics |
Row Details (only if needed)
- None
Frequently Asked Questions (FAQs)
H3: What are RED metrics best used for?
RED metrics are best for request-driven applications to provide fast, actionable SLIs for SLOs and incident triage.
H3: Are RED metrics enough for all services?
No. For batch, streaming, or internal libraries, other SLIs like job success rate, lag, or throughput metrics are more appropriate.
H3: How do I choose p95 vs p99?
Choose p95 for common user experience and p99 for tail UX criticality; both can be used with different SLOs depending on customer expectations.
H3: How do I avoid cardinality issues?
Limit labels to stable dimensions, avoid per-request identifiers, and enforce a cardinality budget in CI/CD.
H3: Can RED metrics be used for serverless?
Yes, but rely on function-level instrumentation and provider metrics; capture cold-start markers and invocation context.
H3: How long should metric retention be?
Short-term high resolution for 90 days and downsampled long-term for 1+ years depending on compliance and trend needs.
H3: How to handle retries in Rate?
Track retry counts separately and deduplicate or mark retries so Rate reflects user-originated traffic if desired.
H3: What aggregation window to use for alerts?
Use a balance like 5m or 10m for detection and longer windows for burn-rate evaluation to reduce noise.
H3: How to correlate RED with business metrics?
Map critical endpoints to business transactions and ensure business SLIs are exposed alongside RED metrics.
H3: Should error budgets trigger automatic rollbacks?
They can, with careful tuning and safety checks, but ensure rollback automation is tested to avoid flapping.
H3: How to measure errors accurately?
Define what an error is (HTTP 5xx, application error codes, business failures) and consistently record them.
H3: What if latency is high but errors are low?
Investigate downstream slow calls, queueing, and resource saturation; use traces to find blocking operations.
H3: How to prevent alert storms during deploys?
Integrate deploy signals with alerting to suppress or lower sensitivity during validated deployment windows.
H3: What role does AI/automation play with RED?
AI can surface anomalies, group related alerts, and suggest remediation, but human validation is required for automation of critical actions.
H3: Can RED help with security incidents?
Yes, anomalous rate spikes or error patterns can be early indicators of abuse or attacks and should feed security workflows.
H3: How to set initial targets for SLOs?
Use historical performance as baseline and meet customer expectations; iterate with error budgets and gradual tightening.
H3: How to instrument libraries used across services?
Expose sanitized metrics from libraries and provide configuration to tenant services to avoid label explosion.
H3: How to test RED metric alerts?
Run synthetic traffic and chaos experiments to validate alert behavior and ensure runbooks work as intended.
Conclusion
RED metrics are a pragmatic, request-focused SLI pattern that supports fast triage, SLO-driven operations, and safer deployments in cloud-native environments. They are not a complete observability solution but a high-leverage starting point for SRE practice, automation, and cost-effective monitoring.
Next 7 days plan (5 bullets)
- Day 1: Inventory services and identify top endpoints to instrument.
- Day 2: Add or validate request counters, error counters, and duration histograms.
- Day 3: Configure metrics pipeline and build on-call and debug dashboards.
- Day 4: Define SLOs for critical services and set initial alert rules.
- Day 5–7: Run synthetic tests, adjust thresholds, document runbooks, and schedule a game day.
Appendix — RED metrics Keyword Cluster (SEO)
- Primary keywords
- RED metrics
- RED metrics guide
- RED metrics SRE
- Rate Errors Duration
-
RED SLI SLO
-
Secondary keywords
- request-centric monitoring
- RED metrics Kubernetes
- RED metrics serverless
- RED metrics Prometheus
-
RED metrics best practices
-
Long-tail questions
- what are RED metrics in SRE
- how to implement RED metrics in Kubernetes
- RED metrics vs golden signals
- can RED metrics detect DoS attacks
- measuring RED metrics with OpenTelemetry
- how to create SLOs from RED metrics
- RED metrics for serverless cold starts
- alerting strategies for RED metrics
- common RED metrics mistakes to avoid
-
how to reduce cardinality in RED metrics
-
Related terminology
- service level indicator
- service level objective
- error budget burn
- p95 latency
- p99 latency
- latency histogram
- request throughput
- rate limiting
- autoscaling policies
- canary deployment
- distributed tracing
- OpenTelemetry
- Prometheus histogram
- metrics cardinality
- observability pipeline
- burn-rate alerting
- synthetic monitoring
- traceroute for web apps
- runtime metrics
- telemetry retention
- remote write metrics
- sidecar exporter
- collector pipeline
- incident response runbook
- postmortem reliability
- feature flag monitoring
- error classification
- percentile estimation
- tail latency troubleshooting
- resource saturation indicators
- chaos game day
- automated rollback policy
- deployment tagging
- metrics ingestion lag
- trace sampling strategy
- histogram bucket design
- cardinality budget policy
- tenant-level SLIs
- business-level SLOs
- observability cost optimization
- AI anomaly detection for metrics
- security monitoring via RED
- WAF and RED signals
- DB call latency
- request concurrency gauge
- throttling metrics
- cold-start identifier
- cloud provider metrics
- APM integration for RED