Website Downtime Tracking: How to Measure and Report Outages

Learn how to track website downtime, measure availability, and create outage reports. Practical guide for SREs and DevOps teams.

· Project Helena · 4 min read ·
uptime monitoring downtime incident management

Knowing your website went down is step one. Tracking when, how long, and how often is what turns incidents into actionable data. Here’s how to build a downtime tracking practice that improves reliability over time.

What is Downtime Tracking?

Downtime tracking is the systematic recording of every period when your service is unavailable or degraded. It goes beyond uptime monitoring (which detects incidents in real-time) to provide historical data for SLA reporting, trend analysis, and capacity planning.

A complete downtime record includes:

  • Start time — When the outage began (detected by monitoring)
  • End time — When the service recovered
  • Duration — Total downtime in minutes/seconds
  • Impact — Which services/users were affected
  • Root cause — What caused the outage
  • Category — Infrastructure, deployment, dependency, etc.

How to Measure Downtime

External uptime monitoring tools check your service from outside your network and record every failure. This is the most accurate method because it measures what users experience.

The measurement formula:

Downtime = Number of failed checks x Check interval

At 1-minute check intervals, if 5 consecutive checks fail, you had approximately 5 minutes of downtime. Higher-frequency checks (every 10 seconds) give more precise measurements.

Method 2: Log Analysis

Parse your server access logs and error logs to identify periods with elevated error rates or zero traffic. This catches issues monitoring might miss but requires log infrastructure.

Method 3: Real User Monitoring (RUM)

Collect availability data from actual user sessions. This shows real user impact but can’t detect outages during low-traffic periods (middle of the night, holidays).

Combining Methods

Best practice is external monitoring as your primary source of truth, supplemented by log data and RUM for context. The monitoring tool defines when downtime starts and ends; logs and RUM explain the impact.

Calculating Availability

The standard formula:

Availability % = ((Total minutes - Downtime minutes) / Total minutes) x 100

For a 30-day month (43,200 minutes) with 45 minutes of downtime:

((43,200 - 45) / 43,200) x 100 = 99.896%

Use the uptime calculator to convert between downtime duration and availability percentage.

Creating Downtime Reports

Monthly SLA Report

Include:

  1. Overall availability — The headline number (e.g., 99.95%)
  2. Incident summary — Each outage with duration, impact, and root cause
  3. Trend chart — Monthly availability over the last 12 months
  4. Error budget status — How much of your error budget was consumed
  5. Action items — What you’re doing to prevent recurrence

Incident Report (Per Outage)

For each significant outage, create a postmortem:

  • Timeline of events
  • Root cause analysis
  • Impact assessment (users affected, financial cost)
  • Corrective actions with owners and deadlines

Tracking Downtime Over Time

The most useful metric isn’t a single month’s availability, it’s the trend. Track:

  • Incidents per month — Is frequency increasing or decreasing?
  • Mean Time Between Failures (MTBF) — Average time between incidents. Higher is better
  • Mean Time To Detect (MTTD) — How fast you find problems. Determined by check interval
  • Mean Time To Resolve (MTTR) — How fast you fix problems. Improved by runbooks and automation
  • Availability trend — Month-over-month and quarter-over-quarter

A service with 99.9% availability and improving MTTR is in better shape than a service with 99.99% availability and worsening incident frequency.

Tools for Downtime Tracking

  1. Uptime monitoring tools — Warden, Uptime Robot, Better Uptime. These are your primary data source
  2. Status page platforms — Public-facing incident history for customers
  3. Incident management tools — PagerDuty, Opsgenie. Track incident lifecycle
  4. Spreadsheets — Surprisingly effective for small teams. Log every incident manually
  5. Custom dashboards — Grafana, Datadog. Visualize trends from monitoring data

Common Mistakes

  • Not tracking scheduled maintenance — Even planned downtime should be recorded (but flagged as planned)
  • Rounding generously — 4 minutes and 20 seconds of downtime is not “about 4 minutes.” Precision matters for SLA calculations
  • Ignoring partial outages — If 30% of users can’t access your API, that’s partial downtime worth tracking
  • No root cause categorization — Without categories, you can’t identify systemic issues (e.g., “50% of outages are deployment-related”)
  • Tracking without action — Data without follow-up actions is just noise. Every report should include improvement actions

Start tracking today. Even a simple spreadsheet with date, duration, and cause will reveal patterns within a few months.


Related tools:

Stay in the loop

Get notified about new posts, product updates, and engineering insights.

Join the waitlist →