SMS Deliverability Monitoring & Alerting: KPI Playbook with Dashboard Templates

Introduction: treat deliverability like uptime, not a vanity metric

Most teams look at SMS deliverability once a month as a single percentage.

“Looks good, we’re around 95%.”

Meanwhile:

  • One US carrier silently starts filtering a new promo flow.
  • A high-value OTP sequence begins failing at 2 a.m.
  • A burner pool gets tired and error codes quietly climb.

By the time anyone notices, you’ve:

  • Lost 5–6 figures in revenue from abandoned checkouts or deposits.
  • Damaged trust (“I never got the code, your app is broken.”).
  • Trained carriers to treat your brand as noisy or risky.

In our work triaging hundreds of deliverability incidents, the pattern is clear: teams that treat deliverability like site reliability (SRE) recover fast. Teams that treat it like a weekly vanity metric get blindsided.

This guide shows you how to:

  • Pick the right KPIs (and ignore misleading ones).
  • Slice data by carrier, sender pool, route, and campaign.
  • Build a dashboard and alerting system that catches issues early.
  • Use monitoring to improve deliverability, not just report it.

Section 1: The core SMS deliverability KPIs that actually matter

You don’t need 40 metrics. You need a small set of KPIs that map directly to incidents and recovery.

1. Delivered rate (by carrier, pool, campaign)

Definition:

  • Delivered rate = messages with positive “delivered” receipts ÷ total send attempts

Best practice:

  • Always slice by:
    • Carrier (Verizon, AT&T, T-Mobile, international operators)
    • Sender pool / grid
    • Campaign / flow (OTP, promos, transactional)
    • Country / region

What “good” looks like (US A2P, properly configured):

  • Core transactional flows: 99%+
  • High-volume promos: 98–99%+
  • Anything consistently under 97–98% needs investigation.

2. Hard-fail / error rate

Definition:

  • Percentage of messages with definitive failure codes:
    • Invalid number
    • Unknown subscriber
    • Permanent carrier rejection

Why it matters:

  • Rising hard-fails often mean:
    • Poor list hygiene.
    • Carrier-level blocking of specific senders or content.
    • A tired or burned-out number pool.

Watch for:

  • Sudden jumps on a single carrier.
  • Specific routes or pools with >1–2% persistent hard-fail rate.

3. Soft-fail / retry rate

Definition:

  • Temporary failures:
    • Network issues
    • Congestion
    • Rate limiting / throttling

Why it matters:

  • High soft-fails = you’re pushing carriers too hard or hitting congested routes.
  • Shows whether your retry strategy is working or just hammering.

4. Unknown / filtered / “fake delivered” indicators

Carriers don’t always give a “filtered” code. Some:

  • Return generic errors.
  • Claim “delivered” while devices get nothing (shadow filtering).

Proxies to monitor:

  • Drops in downstream behavior (clicks, logins) despite “OK” receipts.
  • Sampling tests: seed numbers on each carrier that you log separately.
  • Sudden performance drops on new campaigns while others stay stable.

5. Pool and grid health

If you use:

  • Burner Number Pools
  • Private Pool Grids
  • Or even simple dedicated numbers

…you should track, per pool/grid:

  • Delivered rate
  • Hard-fail rate
  • Complaint / opt-out rate
  • Daily messages per sender

Healthy patterns:

  • Steady performance over time.
  • No sender crossing:
    • >1% hard-fail in a 24-hour window.
    • >0.3–0.5% complaint / opt-out on promos.

Section 2: The “deliverability cube,” how to segment your data

A single global “delivery rate” hides everything.

You need a deliverability cube:

  • Carrier (Verizon, AT&T, T-Mobile, etc.)
  • Sender (pool, grid, individual number)
  • Route / product (gateway, region)
  • Campaign / flow (OTP, promos, transactional)
  • Content risk level (mainstream, high-risk, SHAFT)

Example slice that catches real issues

  1. Verizon × Promo × Grid A:

    • Delivered rate drops from 99.1% → 94.4% over 48 hours.
    • Hard-fails and soft-fails slightly up.
    • Other carriers are stable.
  2. Action:

    • Shift promos from Grid A to Grid B for Verizon.
    • Inspect recent content changes and velocity patterns.
    • Temporarily reduce volume to baseline + 20% while you test.

Without segmentation, you’d only see:

  • Global delivered: 97.8% → 96.9% (shrug).

With segmentation, you see:

  • One output in the matrix is burning out while others are healthy.

Section 3: Alert thresholds and what to do when they fire

1. Carrier-specific delivered rate alerts

Recommended thresholds (adjust per baseline):

  • Alert if delivered rate on any major carrier:
    • Drops >2 points from 7‑day median.
    • Or falls below 97% for longer than 30–60 minutes on active traffic.

Runbook:

  1. Confirm it’s not a data glitch (dashboards, raw logs).
  2. Check:
    • Recent deploys (content changes, routing changes).
    • New campaign launches.
    • Volume spikes.
  3. Mitigate:
    • Temporarily reduce sending velocity on that carrier.
    • Switch to alternative pool / grid if available.
    • Pause new risky campaigns for that carrier.

2. Pool / grid health alerts

Alert when:

  • Any pool or grid’s hard-fail rate exceeds 1–2% for >1 hour on meaningful volume.
  • Complaint / opt-out rates exceed 0.3–0.5% on promos.

Runbook:

  1. Stop sending new campaigns on that pool / grid.
  2. Shift some traffic to healthier pools.
  3. Investigate:
    • Did you mix higher-risk content onto a formerly clean pool?
    • Did carrier policies change (e.g., new rule on SHAFT keywords)?

3. Shadow filtering & “fake delivery” alerts

Because you won’t always see clear error codes:

  • Compare:
    • Delivered messages → expected conversions (clicks, logins, OTP uses).
  • Alert when:
    • Deliverability stays “good” but downstream conversion falls sharply for one carrier or campaign.

This is where:

  • Seed numbers per carrier are invaluable.
  • Periodic live tests (manual + automated) catch reality vs receipts.

Section 4: Designing the SMS deliverability dashboard

Your dashboard doesn’t have to be fancy. It has to be useful under pressure.

Layout 1: Executive overview

Top-level tiles:

  • Global delivered rate (last 24h, 7d)
  • Per-carrier delivered rate (Verizon, AT&T, T-Mobile, top 3–5 internationals)
  • % messages by:
    • Transactional vs marketing
    • Mainstream vs high-risk

Trends:

  • Line charts:
    • Delivered rate by carrier over time.
    • Volume by carrier.

Use this to answer: “Are we on fire, yes or no?”

Layout 2: Ops / SRE view

Tables and charts by:

  • Carrier × Pool × Campaign
  • Pool health metrics (delivered, hard-fail, soft-fail, complaints)

Examples:

  • Heatmap: delivered rate by carrier (columns) and pool/grid (rows).
  • Table with sorting:
    • “Show pools with highest hard-fail rate today.”

Use this when an alert fires.

Layout 3: Analytics / marketing view

Focus on:

  • Campaign performance:
    • Delivered rate vs CTR vs conversion.
  • A/B tests:
    • Content variants vs deliverability.

This view bridges deliverability and revenue, making it easier to justify infra decisions.


Section 5: Diagnosing common issues with your metrics

Scenario 1: One carrier tanks, others are stable

Likely causes:

  • Carrier‑specific filtering on:
    • Content pattern.
    • URL domain.
    • Sender pool reputation.

What to check:

  • Any recent content or template changes?
  • New URLs being used? (e.g., changed link shortener)
  • Volume ramp: did you spike too fast on that carrier?

Scenario 2: All carriers degrade at once

Likely causes:

  • Global content change (e.g., more aggressive promos).
  • Aggressive volume ramp across the board.
  • Platform-level change (routing, pool logic).

What to check:

  • Last few deployments.
  • New high-risk campaigns.
  • Whether controls (burner logic, per-carrier caps) are actually enforced.

Scenario 3: Metrics look fine, but support inbox fills with “I didn’t get it”

Likely causes:

  • Device-level filtering (spam folders).
  • Shadow filtering at carrier level with misleading receipts.
  • Regional pockets affected (e.g., specific area codes).

What to check:

  • Seed device tests on each carrier.
  • Region / area-code breakdowns.
  • Presence of sensitive keywords or patterns.

Section 6: How deliverability monitoring changes your infrastructure choices

Once you see:

  • Which pools degrade fastest
  • Which carriers are most sensitive
  • How content and volume affect outcomes

…it becomes obvious why infrastructure matters.

Teams that move to:

  • Private Pool Grids (100+ multi‑carrier SIMs per grid)
  • Carrier‑matching algorithms (Verizon→Verizon, AT&T→AT&T)
  • Burner Number Pools with automated retirement

…can use their dashboards to:

  • Proactively rotate and cool down senders.
  • A/B test routing strategies, not just content.
  • Create per‑carrier playbooks instead of generic fixes.

We regularly see:

  • 40–60% fewer incidents after deploying proper monitoring and grid‑based routing.
  • Faster RCA (root cause analysis) because logs and metrics line up.
  • Better risk conversations with compliance and legal (“here’s exactly how we’re controlling abuse and monitoring complaints”).

FAQ: SMS deliverability metrics & dashboards

1. What’s a “good” global delivery rate?

For a healthy, well‑architected program:

  • Transactional flows: 99%+
  • High-volume marketing: 98–99%

Anything under 97–98% on core flows is a red flag.

2. How often should we check deliverability?

  • Dashboards: daily (or more during launches).
  • Alerts: real-time for significant drops.
  • Deep reviews: weekly or monthly with trend analysis.

3. Do I really need per‑carrier data?

Yes. Most serious incidents are carrier-specific. Without per‑carrier slices, you’re flying blind.

4. What about small senders? Is this overkill?

If you:

  • Send low volume.
  • Operate in low‑risk verticals.
  • Don’t drive mission‑critical revenue via SMS.

…you can get away with simpler monitoring. But the moment SMS is core revenue, you’ll wish you had this in place.

5. How do I start if my current provider doesn’t expose good metrics?

Options:

  • Pull CDRs / logs and build your own aggregation.
  • Use webhooks to log DLRs into your data warehouse.
  • Consider a gateway that exposes carrier‑level data by design.

6. How does this relate to A2P 10DLC registration?

10DLC compliance affects:

  • Allowed volume.
  • Scrutiny level.
  • Penalties for abuse.

Monitoring delivers the feedback loop that tells you if:

  • Your campaigns are behaving within carrier expectations.
  • You’re about to trip a threshold.

7. Can monitoring fix bad content or consent?

No. It can only tell you:

  • How bad things are.
  • Where they’re bad.

You still need clean opt-in, clear messaging, and respect for local law.

8. How do I detect device-level spam filtering?

  • Seed devices across carriers and platforms (iOS/Android).
  • Correlate “delivered” receipts with real device receipts and behavior.

9. Where does privacy fit into all this?

A privacy‑first gateway should:

  • Minimize stored PII.
  • Offer clear data retention controls.
  • Still provide aggregated metrics without leaking sensitive content.

10. Do I need a dedicated deliverability engineer?

Not necessarily. But you do need:

  • Clear ownership (someone accountable).
  • Runbooks and dashboards that non‑experts can follow in an incident.

Conclusion: make deliverability observable before it becomes expensive

You can’t fix what you can’t see.

A basic deliverability dashboard and alerting setup can:

  • Catch carrier‑specific issues before they explode.
  • Prove the ROI of better infrastructure (carrier matching, private grids).
  • Turn SMS from a black box into an operationally managed system.

If SMS is tied to revenue, treat it like an SRE problem:

  • Instrument it.
  • Alert on it.
  • Build runbooks around it.

Once you have that in place, you’re in a perfect position to evaluate whether a private, carrier-matching gateway is worth it, because you’ll have hard data showing where your current provider is leaving money on the table.

Dach SMS Lab

Dach SMS Lab