Incident Runbook: DNS Failover & Postmortem Template

A practical runbook template for public-facing services: public comms, DNS failover, cached-content fallback, and a tight postmortem.

When the CDN that serves millions of customers hiccups, your users don’t care about provider names — they care that your product is down. If you run a public-facing consumer service that depends on third-party CDNs, you need a runbook that covers public communication, DNS failover, cached-content strategies and a pragmatic postmortem workflow. This article gives a ready-to-use, operational runbook template and practical commands to reduce customer pain during CDN outages and accelerate reliable recovery.

Executive summary — why this matters in 2026

High-profile CDN and edge outages in late 2025 and early 2026 reminded teams that one vendor incident can cascade into a brand crisis. Organizations are shifting toward multi-CDN, origin fallback patterns, and clearer public communication. In this climate, an incident runbook must be more than checkboxes: it must include templates for status updates, scripted DNS failover actions, cached-content fallbacks, verification steps, and a tight postmortem that produces measurable preventative work.

What you’ll get

A prioritized incident runbook template for consumer-facing services
Practical commands and verification checks for DNS failover and caches
Public communication templates for status pages and social channels
A compact but effective postmortem structure

Incident classification and responsibilities

Start with a short incident classification so responders share a common language. A simple scale works:

P0 — Brand-impacting outage: Site or primary flows down for large percentage of users, social noise, media attention.
P1 — Major degradation: Significant partial outages, performance severe enough to cause user churn.
P2 — Functional issue: Edge/feature broken but with workarounds or low impact.

Core roles (RACI-lite):

Incident Lead: coordinates triage, public comms, and timeline.
Infra Lead: owns DNS failover, origin health, CDN config
Comm Lead: status-page and social messages, legal/PR escalation
SRE/Dev: run verification checks, apply hotfixes

Quick “first 15 minutes” checklist

Confirm scope: Are all endpoints affected? Is it static assets only or the application backend?
Set incident channel (Slack/Mattermost) and document start time.
Publish a short status page entry: "We are investigating reports of service disruption." (See templates below.)
Run basic reachability checks: curl, dig, and compare vantage points.
Decide: apply CDN-level mitigation (purge / reconfigure) or trigger DNS failover to alternate origin or CDN.

Detection & verification — commands you should run immediately

Run these from multiple geographic vantage points (local machine, cloud VM in different region, and a trusted remote test host).

curl -I https://www.example.com        # quick HTTP response headers
curl https://www.example.com --resolve "www.example.com:443:203.0.113.5" -v  # test specific IP
dig +short www.example.com @8.8.8.8
dig +trace www.example.com

Check CDN/edge response headers: look for headers that identify the CDN (Server, via, cf-ray, x-cache). If edge headers are missing or show 503/524, note the pattern.

Public communication — templates and cadence

Transparent, factual and frequent updates reduce brand damage. Use short, timestamped messages and escalate tone as the incident evolves.

Initial status (under 15 minutes)

Headline: Investigating service disruption (start-time UTC)
Body: We’re aware some customers can’t access the site or are seeing errors. Our engineers are investigating. We’ll post updates here every 30 minutes. No action is required from users at this time.

15–60 minute follow-up (if confirmed CDN/edge issue)

Headline: Service degraded due to CDN outage (start-time UTC)
Body: Our telemetry shows errors from our CDN provider affecting asset delivery and some page loads. We’re working on origin fallbacks and a DNS failover plan. Estimated next update: +30 minutes.
Workaround: Try disabling extensions or retry after 1–2 minutes. We’ll post a confirmed workaround when available.

Pinned update (when failing over)

Headline: Failing over traffic to backup origin/CDN (time UTC)
Body: We are routing a portion of traffic to a backup CDN and enabling cached content fallback. You may experience short spikes in latency as DNS propagates. We will confirm when normal service resumes.

DNS failover — strategies, tradeoffs, and scripted actions

DNS failover is powerful but imperfect. DNS caching, DoH/DoT, ISP resolvers and client caches cause variable propagation delays. Use DNS failover when you need an automated switch to a healthy origin or a secondary CDN, and pair it with health checks and traffic steering where possible.

Design principles

Low TTLs before incidents: set a reasonable low TTL for services where fast failover is business-critical. Suggested baseline: 60–300 seconds for service endpoints; 3600s for assets that benefit from CDN caching.
Secondary authoritative DNS: use a DNS provider that supports API-driven changes and has low latency for updates. Consider multi-authoritative DNS to avoid single-vendor failure.
Health checks: automate origin and CDN health checks; tie Route / weighted records to health state.

Simple failover workflow (manual via DNS API)

Identify authoritative zone and current A/AAAA/CNAME that points to CDN.
Prepare alternate record (backup origin IP / secondary CDN CNAME).
Execute API update to swap records; set low TTL temporarily if possible.
Verify propagation with dig from multiple resolvers.

# verification examples (no vendor-specific API calls)
dig +short www.example.com @8.8.8.8
dig +short www.example.com @1.1.1.1
curl -I https://www.example.com --resolve "www.example.com:443:203.0.113.5"

Practical notes and gotchas

TTL is a suggestion: Many resolvers ignore low TTLs and cache longer. Expect partial coverage during the first few minutes.
SSL/TLS: If you point clients to a backup origin IP, ensure the TLS certificate covers that hostname or use SNI-friendly endpoints.
Cookies and session affinity: Switchover can break sticky sessions. Use shared session stores or token-based auth to reduce impact.
DoH/DoT cache behavior: DNS-over-HTTPS implementations may cache results in client or upstream resolvers longer than classic TTL semantics.

Cached content and origin-fallback strategies

When a CDN edge fails, cached assets and intelligent fallbacks can keep the site usable. Plan for three layers of fallback:

Edge cache: use stale-while-revalidate and stale-if-error to allow edges to serve slightly stale content during upstream failures.
Origin cached assets: keep a CDN-agnostic object store (S3, MinIO) with cross-region replication for serving via alternate endpoints.
Client-side fallback: service-worker or app-level cache that serves a shell UX when remote assets are unavailable.

Cache headers you should adopt

# Minimal example headers to enable graceful cache fallbacks (set by origin)
Cache-Control: public, max-age=3600, stale-while-revalidate=60, stale-if-error=86400
ETag: W/"v12345"

Where provider supports it, set surrogate-control headers for edge TTL and shorter max-age for browsers. This gives you control over both edge and browser caching.

Static site emergency hosting recipe

Keep static builds in a versioned object bucket that you can expose over HTTPS from a backup CDN or directly via static hosting (S3 + CloudFront, or any S3-compatible public endpoint).
Pre-build a minimal offline shell (index.html) that uses local assets and lazy-loads heavy assets when available.
When CDN outage happens, flip DNS CNAME for your static hostname to the backup origin/bucket.

Verification checklist after failover

Confirm authoritative DNS changed: dig +trace and target resolvers.
Confirm TLS handshake and certificate chain from multiple locations.
Load several critical user journeys and synthetic checks from multiple regions.
Monitor error rates and latency for 30m–2h for stability.

Communication cadence during the incident

Initial (0–15m): publish investigation notice.
Ongoing (every 30–60m): updates about mitigations and next steps.
Resolved: short note describing restoration and whether it’s temporary or permanent.
Postmortem window: announce when a full postmortem will be published (usually within 48–72 hours for P0s).

Postmortem — a templated structure that drives remediation

High-quality postmortems avoid blame and focus on facts, impact, and measurable fixes. Use this compact template and publish it externally for transparency (when appropriate).

Postmortem template

Title and summary: One-paragraph summary of what happened and impact.
Timeline: minute-level sequence from first alert to full recovery; include commands run and config changes.
Root cause: concise technical root cause and contributing factors.
Impact: concrete metrics — uptime lost, user sessions affected, error spikes, revenue/customer support load.
What went well: defensive measures that reduced impact.
What went poorly: where the runbook or automation failed.
Action items: prioritized, assigned, and with due dates. Include verification criteria for each fix.
Follow-up review: schedule a 2-week check-in on action item status.

Sample postmortem action items

Implement multi-CDN for static assets and test monthly failover (owner: Infra, due: 30 days).
Lower DNS TTL to 60s for critical hostnames and document rollback steps (owner: DNS team, due: 7 days).
Publish a public postmortem draft and FAQ for impacted customers (owner: Comm, due: 48 hours).
Create synthetic checks for CDN error patterns and add to PagerDuty alerting (owner: SRE, due: 14 days).

Automation snippets and runbook artifacts (examples)

Keep these snippets in your runbook repository (Git) so responders can copy/paste in an incident.

DNS verification script (unix shell)

#!/bin/sh
HOST=www.example.com
echo "Public DNS check:"
dig +short $HOST @8.8.8.8
echo "Cloudflare DNS (1.1.1.1):"
dig +short $HOST @1.1.1.1
curl -I https://$HOST

Minimal status page checklist

Headline (short)
Start time
Impact summary
What we’re doing
Next update ETA

Testing and exercises — don’t wait for the real outage

Run regular tabletop exercises and scheduled failovers. In 2026, teams increasingly rely on planned multi-CDN failover drills and “switch the DNS” rehearsals that validate both technical steps and comms templates. Practical frequency:

Monthly runbook walkthroughs with on-call rotation
Quarterly failover drills (DNS/backup origin)
Annual core business continuity test (includes comms and support teams)

Trends and considerations for 2026 and beyond

Late 2025 and early 2026 outages accelerated a few durable trends:

Multi-CDN and orchestration: More teams adopt multi-CDN to reduce single-vendor blast radius; orchestration tooling is maturing to make switching fast and auditable.
Edge compute sprawl: With more logic at the edge, origin fallback and coherent cache policies are even more important.
DNS behavior complexity: DoH/DoT and ISP resolver caching behavior mean DNS-based failover will never be instant for all users; pair DNS with origin-level backstops.
Expectation of transparency: Users now expect fast public updates. A calm, factual status page reduces social amplification and support volume.

"Fast, frequent, factual." — the three principles for customer-facing incident communication in 2026.

Final checklist — keep this pinned in your incident channel

Have a single Incident Lead and keep updates brief and factual.
Run DNS and cache verification from multiple locations immediately.
Decide early: DNS failover, origin fallback or provider-side mitigation.
Push public updates every 30–60 minutes until stable.
Publish a public postmortem with clear action items within 72 hours for P0 incidents.

Actionable takeaways

Embed this runbook template in your on-call handbook and test it quarterly.
Store pre-built static shells and versioned objects to enable fast static failover.
Automate DNS and CDN changes via provider APIs and pre-approved scripts to avoid manual errors under pressure.
Prepare comms templates and the postmortem outline ahead of time — publishing within 72 hours increases trust.

Call to action

Download the ready-to-run incident runbook (YAML + scripts + status templates) we use at solitary.cloud, import it into your incident management repo, and run a simulated CDN failover this month. If you prefer a hands-off approach, contact our team to run a failover drill and harden your DNS and cached-content fallbacks — we’ll provide a tailored runbook and a 30-day remediation plan.

From Social Outage to Disaster Recovery: Building an Incident Runbook for Public-Facing Services

Executive summary — why this matters in 2026

What you’ll get

Incident classification and responsibilities

Quick “first 15 minutes” checklist

Detection & verification — commands you should run immediately

Public communication — templates and cadence

Initial status (under 15 minutes)

15–60 minute follow-up (if confirmed CDN/edge issue)

Pinned update (when failing over)

DNS failover — strategies, tradeoffs, and scripted actions

Design principles

Simple failover workflow (manual via DNS API)

Practical notes and gotchas

Cached content and origin-fallback strategies

Cache headers you should adopt

Static site emergency hosting recipe

Verification checklist after failover

Communication cadence during the incident

Postmortem — a templated structure that drives remediation

Postmortem template

Sample postmortem action items

Automation snippets and runbook artifacts (examples)

DNS verification script (unix shell)

Minimal status page checklist

Testing and exercises — don’t wait for the real outage

Trends and considerations for 2026 and beyond

Final checklist — keep this pinned in your incident channel

Actionable takeaways

Call to action

Related Topics

solitary

Up Next

Website Speed Test Guide: How to Measure Performance and What Metrics Matter

Best Hosting for Online Stores: Ecommerce Platforms and Cloud Options Compared

How to Point a Domain to Your Website Builder or Hosting Provider

Executive summary — why this matters in 2026

What you’ll get

Incident classification and responsibilities

Quick “first 15 minutes” checklist

Detection & verification — commands you should run immediately

Public communication — templates and cadence

Initial status (under 15 minutes)

15–60 minute follow-up (if confirmed CDN/edge issue)

Pinned update (when failing over)

DNS failover — strategies, tradeoffs, and scripted actions

Design principles

Simple failover workflow (manual via DNS API)

Practical notes and gotchas

Cached content and origin-fallback strategies

Cache headers you should adopt

Static site emergency hosting recipe

Verification checklist after failover

Communication cadence during the incident

Postmortem — a templated structure that drives remediation

Postmortem template

Sample postmortem action items

Automation snippets and runbook artifacts (examples)

DNS verification script (unix shell)

Minimal status page checklist

Testing and exercises — don’t wait for the real outage

Trends and considerations for 2026 and beyond

Final checklist — keep this pinned in your incident channel

Actionable takeaways

Call to action

Related Reading

Related Topics

solitary

Up Next

Website Speed Test Guide: How to Measure Performance and What Metrics Matter

Best Hosting for Online Stores: Ecommerce Platforms and Cloud Options Compared

How to Point a Domain to Your Website Builder or Hosting Provider