Status page SaaS: monitors, incidents, subscribers

Research-based methodology. This guide draws on Atlassian Statuspage public docs, Better Stack’s engineering blog, the Cronitor and OpenStatus open-source codebases, Vercel Cron documentation, and our own builds with Claude. Where we have first-person experience we say so; otherwise we’re working from public sources. How we research.

Why a status page SaaS in 2026

Statuspage (now Atlassian) costs $29 per month for the smallest team plan with no monitoring included — you pay for the page, you pay separately for the uptime checks. Better Stack starts at $29/month and is excellent but priced for funded startups. Cronitor and OpenStatus are great products, but they’ve drifted upmarket. Underneath all of them are tens of thousands of indie SaaS builders, API-first products, and tiny teams that want a real status page with built-in monitoring, sub-$15/month, and one deep Slack or Discord integration. That’s the wedge for a solo founder in 2026.

This guide is for someone who wants to ship a paid status-page-plus-monitoring product in 4–6 weeks using Claude as their primary thinking partner. The hard work is not the public page — that’s a few hundred lines of React. The hard work is the monitoring engine that runs reliably forever, and the incident state machine that gives operators a place to actually communicate during a fire. If you want the broader build playbook first, our How to build a SaaS with Claude guide covers the general scaffolding workflow this one builds on top of.

Why status pages are harder than they look

The visible product is one page with a green checkmark and a 90-day uptime bar. The actual product is a small distributed system that has to run forever, hit your customers’ APIs from at least one external location, distinguish between “the internet is broken” and “the customer’s API is broken,” and never miss an alert.

Monitoring is the engine, not a feature

If you outsource monitoring to UptimeRobot and just render their data, you’re not a status page company — you’re a UI on top of UptimeRobot. The check engine is the moat. It needs to run every 1–5 minutes per monitor, store every result, and trigger incidents on configurable failure thresholds.

Timeseries data growth is real

One monitor at 1-minute frequency = 525,600 rows per year. 100 customers with 10 monitors each = 525M rows per year. If you put this in your main Postgres without a partitioning strategy, your incident dashboard query times out at month four. The fix is either Postgres time-partitioned tables, TimescaleDB hypertables, or a separate ClickHouse instance for check results.

False positives are worse than false negatives

If your check fails because your network blipped, you fire an alert that wakes the customer at 3am, and you eat the trust permanently. The fix is multi-region checking: if a monitor fails from one region, retry from another before declaring an incident. Statuspage and Better Stack both do this. You must too.

Incidents are a state machine, not a row

An incident moves through investigating → identified → monitoring → resolved. Each transition is a public update. Each update emails subscribers, posts to Slack, hits the RSS feed. Modeling this as a single “current_status” column on the incident row will paint you into a corner. You need an incident_updates table where each row is one published transition.

Step 1 — Timeseries-friendly data model

The schema for a status page SaaS is bigger than most because the timeseries side wants to live differently from the relational side. The core tables: workspaces, services (logical components on the page), monitors (the actual checks), check_runs (the timeseries), incidents, incident_updates, and subscribers.

Prompt 1 — Timeseries-friendly data model

I'm building a status page SaaS targeted at indie SaaS founders. The
core data is timeseries (every check result for every monitor) so the
schema needs to keep that hot path fast forever.

Design Postgres tables for:

- workspaces: id, slug, name, owner_id, plan, created_at
- services: id, workspace_id, name, slug, order_index
  (these are the rows that appear on the public status page)
- monitors: id, workspace_id, service_id, name, monitor_type
  (http_get, http_post, tcp, dns), config jsonb (url, expected_status,
  expected_body_substring, headers, timeout_ms), check_interval_seconds
  (60, 180, 300, 900), regions text[] (us-east, eu-west, asia)
- check_runs: id (uuid), monitor_id, region, started_at, latency_ms,
  status_code, success boolean, error_text, response_size
  THIS TABLE IS PARTITIONED BY started_at (monthly partitions)
- incidents: id, workspace_id, title, current_status (investigating,
  identified, monitoring, resolved), severity (minor, major, critical),
  affected_service_ids uuid[], started_at, resolved_at, public boolean
- incident_updates: id, incident_id, status, body markdown, posted_at,
  posted_by user_id
- subscribers: id, workspace_id, channel (email, sms, webhook, slack,
  discord), endpoint, confirmed boolean, created_at

For check_runs, write the partition setup:
- Parent table check_runs partitioned BY RANGE (started_at)
- A function to create the next month's partition (callable from cron)
- Indexes on (monitor_id, started_at DESC) per partition
- A function `purge_old_check_runs(retention_days)` that drops old
  partitions instead of DELETE (way faster)

Add Supabase RLS so:
- Workspace members can read all their workspace data
- Public anonymous reads are limited to: services + last 90 days of
  daily aggregated uptime + public incidents + public incident_updates
- Subscribers table is workspace-write-only (public can INSERT via RPC
  with email confirmation flow)

Output one SQL file ready to run.

The single most important detail in Claude’s output is the partitioning. Without it, you will be migrating to ClickHouse at month six and explaining to your customers why their dashboard is suddenly slow. With monthly partitions, you can carry a million monitors per workspace before the design needs to change.

Step 2 — The monitor scheduler and worker

This is the heart of the product. Every minute, your system needs to know which monitors are due, run their checks (in parallel, with retries from a second region on failure), write the results to check_runs, and trigger incident creation on configured failure conditions.

Two viable hosting patterns: Vercel Cron + Edge Functions, or a long-running worker on Railway/Fly.io. Vercel Cron has a 1-minute minimum and a 60-second timeout per invocation, which is fine for the dispatcher but tight for high check counts. A persistent worker on Railway gives you better latency control. For deployment tradeoffs, our Vercel vs Railway comparison covers this exact decision, and Fly.io vs Railway is worth reading if you want multi-region check origins.

Prompt 2 — Monitor scheduler and worker

Build the monitoring engine for my status page SaaS. Stack: Next.js on
Vercel for the dashboard + a long-running Node worker on Railway for the
check engine.

The worker needs:

1. A scheduler loop that ticks every 10 seconds:
   - SELECT monitors WHERE
     last_checked_at IS NULL OR
     last_checked_at < now() - (check_interval_seconds * interval '1 second')
   - LIMIT 200 per tick to avoid thundering herd
   - LOCK rows with FOR UPDATE SKIP LOCKED so multiple worker replicas
     don't double-run the same check

2. For each due monitor, dispatch an async check:
   - Pick the FIRST configured region; run the http(s) request with
     fetch + AbortController for the timeout
   - If success === false, retry from the SECOND configured region
     before declaring failure
   - Insert a row into check_runs with the result
   - UPDATE monitors SET last_checked_at = now()

3. Failure -> incident creation logic:
   - Track consecutive_failures per monitor in a small in-memory map
     (rebuild from DB on worker startup)
   - If consecutive_failures crosses the configured threshold (default 2)
     AND no open incident exists for this monitor's affected services,
     OPEN a new incident with status='investigating', severity inferred
     from monitor config
   - If a check succeeds and an open incident exists, mark current_status
     as 'monitoring' and post an auto-update

4. Health endpoints:
   - GET /health returns 200 if the loop ticked within the last 30 seconds
   - GET /metrics returns counts of checks-run-last-minute, error rate,
     queue depth (Prometheus format)

5. Handling Vercel Cron as a fallback:
   - A /api/cron/sweep route that runs every minute and dispatches any
     overdue checks if the worker is unreachable

Use undici for HTTP fetches (faster than node-fetch). Log structured
JSON to stdout. Use a single Postgres connection pool with max=20.

The non-obvious detail is FOR UPDATE SKIP LOCKED. Without it, two worker replicas will both pick up the same monitor at the same tick and you’ll double-write rows. With it, you can scale the worker horizontally with zero coordination code.

Step 3 — Uptime calculation with downtime exclusions

The 90-day uptime graph is the single most-looked-at element on a status page. It needs to be honest, fast, and account for scheduled maintenance windows that don’t count against your customer’s SLA.

Prompt 3 — Uptime calculation with maintenance exclusions

Write a Postgres function `calculate_uptime(monitor_id uuid,
window_start timestamptz, window_end timestamptz) returns jsonb`.

The function must:

1. Sum total seconds in the window
2. Sum seconds-of-downtime: for each contiguous run of failed check_runs,
   the downtime span is from the FIRST failure timestamp to the FIRST
   subsequent success timestamp
3. Subtract any overlap with scheduled maintenance windows from the
   downtime total. Maintenance windows live in a `maintenance_windows`
   table (workspace_id, monitor_ids, starts_at, ends_at)
4. Return:
   {
     total_seconds: int,
     downtime_seconds: int,
     uptime_pct: numeric (4 decimal places),
     incidents: int,
     longest_outage_seconds: int,
     daily_breakdown: [
       { date: 'YYYY-MM-DD', uptime_pct: numeric, status: 'operational'
         | 'degraded' | 'down' | 'no_data' }
     ]
   }

Daily status thresholds:
- operational: uptime >= 99.9%
- degraded: 95 <= uptime < 99.9
- down: uptime < 95
- no_data: 0 check_runs that day

Optimization rules:
- Use a GENERATE_SERIES on dates and LEFT JOIN, do not loop
- Index hint: rely on (monitor_id, started_at) on check_runs partitions
- The function must complete in under 200ms for a 90-day window with
  1-minute check intervals (~129,600 rows)

Also write a materialized view `monitor_daily_uptime` that pre-computes
the daily_breakdown for all monitors, refreshed every 15 minutes via
Vercel Cron. The status page should read from the view, not call the
function on each pageload.

The materialized view is what makes the public status page fast under load. Reading it is one indexed query; calling the function on every public pageview is how you melt your database when one of your customers goes viral on Hacker News during their own outage.

Step 4 — Incident state machine and communication

An incident is the moment your customer is using your product the most intensely. The UX has to be three things: fast (because they’re panicked), opinionated (because they don’t want to make decisions), and integrated (because their team is in Slack, not in your dashboard).

Prompt 4 — Incident communication and subscriber notification

Build the incident posting flow.

Backend:
1. POST /api/incidents creates a new incident with starting status
   'investigating' and an initial incident_updates row. Body schema:
   { title, severity, affected_service_ids, body_markdown, public }
2. POST /api/incidents/:id/updates appends a new incident_updates row
   with body and new status. Status transitions enforced server-side:
   investigating -> identified -> monitoring -> resolved (skipping
   forward is OK, going backward is not)
3. When status moves to 'resolved', set incidents.resolved_at = now()

Notification fan-out (run as a queue job, not in the request):
On each incident_updates insert where incidents.public = true:
- For each subscriber where channel='email': render an HTML email with
  Resend. Subject: "[]  - "
  Body includes: status badge, body markdown rendered, link to status
  page, unsubscribe link
- For each subscriber where channel='webhook': POST JSON
  { incident_id, status, title, body_markdown, severity, posted_at,
    affected_services: [...] }
  with a HMAC-SHA256 signature header
- For each subscriber where channel='slack': POST to the stored Slack
  webhook URL with Block Kit formatting (header, status badge, body,
  affected services, link)
- For each subscriber where channel='discord': POST to the stored
  Discord webhook URL with an embed (title, color by severity, body)
- Public RSS/Atom feed rebuild at /status/[workspace-slug]/rss.xml

Slack/Discord formatting:
- Investigating = yellow circle emoji + yellow embed color
- Identified = orange triangle emoji + orange
- Monitoring = blue circle emoji + blue
- Resolved = green checkmark emoji + green

Add an undo grace period: an incident_updates row created less than 60
seconds ago can be deleted by the author (don't fan out yet, queue with
60s delay). After 60s, it's permanent and any correction must be a new
update.

The 60-second undo is what saves you the night you accidentally post “all systems down” instead of “all systems up.” A status page where corrections are a separate corrective update is a status page that doesn’t survive its own first incident.

Step 5 — Public status page rendering

The public page is what your customer’s customers see. It needs to be fast under load (especially during incidents), readable on mobile, and customizable enough that the customer feels it’s theirs without you having to build a full theming engine.

Prompt 5 — Public status page rendering

Build /status/[workspace-slug] using Next.js App Router.

Performance requirements:
- Page must render in under 200ms TTFB on Vercel
- Read ONLY from the monitor_daily_uptime materialized view + the
  incidents/incident_updates tables, never call calculate_uptime() live
- Set Cache-Control: public, s-maxage=30, stale-while-revalidate=300
- Add a streaming /api/realtime endpoint (Server-Sent Events) that
  pushes new incident_updates so the page updates without reload

Layout (mobile-first, single column under 768px):

1. Top: workspace logo, name, overall current status badge
   - Overall status = worst current incident severity, OR
     'All systems operational' if no open incidents

2. Active incidents section (only if any open):
   - Card per incident with title, current status, severity color,
     latest update body, "View details" link to incident detail page

3. Services list:
   - One row per service
   - Service name, current operational status, 90-day uptime bar
     (90 small vertical bars, one per day, colored by daily status)
   - Hovering a bar shows the date + uptime% in a tooltip
   - Click expands to show component monitors and their current status

4. Recent incidents section:
   - Last 14 days of resolved incidents
   - Each is a small card: title, severity, started/resolved times
   - Click goes to incident detail page

5. Footer:
   - Subscribe button (opens modal: email, slack webhook, RSS link)
   - "Powered by [your product]" with subtle attribution

Theming:
- Workspace settings stores: primary_color, logo_url, font (system,
  inter, ibm-plex-mono), favicon_url
- Generate CSS variables from these on the server, no client-side
  theming JS

Custom domain support:
- If workspace has custom_domain configured, status page must be
  reachable at status.customerdomain.com via Vercel domain alias
- Detect via Host header in middleware, look up workspace by domain,
  rewrite to /status/[slug]

One detail Claude won’t give you unprompted: the page must work without JavaScript. A lot of status-page visits happen from corporate networks that block scripts, and from emergency war rooms with old browsers. Server-render everything; the SSE realtime layer is enhancement, not requirement.

Pricing and monetization

Per-monitor tiered pricing is the dominant pattern in this category and it works:

Starter — $9/mo for 5 monitors, 5-minute frequency, email + Slack subscribers, single region.
Pro — $29/mo for 25 monitors, 1-minute frequency, all subscriber channels, multi-region, custom domain.
Business — $99/mo for 100 monitors, 30-second frequency, SSO, audit log, status page password protection.

Avoid free-forever plans with even 1 monitor. The cost to serve a free monitor (compute + storage) is real, and free users almost never convert in this category. A 14-day full-feature trial converts much better, and at the price points above the LTV justifies the trial cost.

Where solo founders win against Statuspage

You will not out-feature Statuspage or Better Stack on enterprise concerns — multi-tenancy guarantees, SOC 2, dedicated infrastructure. You also can’t out-engineer them on raw probe-network breadth. Solo founders win in three places:

Price floor. The $9/mo starter tier with monitoring INCLUDED is the wedge. Statuspage charges $29 for the page and you still need monitoring. Pricing yourself at one-third of the incumbent is real differentiation when 80% of the market is sub-$50/mo customers.
Opinionated for indie SaaS. Statuspage is designed for Atlassian-sized teams. Your product is designed for the founder + two engineers running a $30k MRR API. That means smarter defaults (1 monitor for the API, 1 for the dashboard, 1 for the marketing site), Slack-first incident UX, and embeddable status widgets for the customer’s docs.
Deeper Slack/Discord integration. Most status pages treat Slack as a one-way webhook. You can do better: Slack slash commands to post incident updates without leaving Slack, threaded follow-ups that sync to incident_updates, and Discord parity (which most incumbents skip entirely).

Each of these maps to a real AI SaaS idea that compounds with the core build. Pick the niche — API-first products, Shopify apps, indie SaaS — and ship the public status page they’ll embed before you build the workspace settings.

Status page SaaS, in one paragraph

Partition the timeseries. Schedule with SKIP LOCKED. Pre-compute uptime. Slack-first incidents.

A status page SaaS that owns the monitoring engine, partitions check_runs from day one, and ships a Slack-first incident workflow at one-third Statuspage’s price has a real shot at the indie SaaS market. Build the worker before the dashboard.

Status page SaaS, end-to-end

Why a status page SaaS in 2026

Why status pages are harder than they look

Monitoring is the engine, not a feature

Timeseries data growth is real

False positives are worse than false negatives

Incidents are a state machine, not a row

Step 1 — Timeseries-friendly data model

Step 2 — The monitor scheduler and worker

Step 3 — Uptime calculation with downtime exclusions

Step 4 — Incident state machine and communication

Step 5 — Public status page rendering

Pricing and monetization

Where solo founders win against Statuspage

Related guides

Get one SaaS build breakdown every week