SMS Deliverability Monitoring & Alerting: KPI Playbook with Dashboard Templates
Introduction: treat deliverability like uptime, not a vanity metric
Most teams look at SMS deliverability once a month as a single percentage.
“Looks good, we’re around 95%.”
Meanwhile:
- One US carrier silently starts filtering a new promo flow.
- A high-value OTP sequence begins failing at 2 a.m.
- A burner pool gets tired and error codes quietly climb.
By the time anyone notices, you’ve:
- Lost 5–6 figures in revenue from abandoned checkouts or deposits.
- Damaged trust (“I never got the code, your app is broken.”).
- Trained carriers to treat your brand as noisy or risky.
In our work triaging hundreds of deliverability incidents, the pattern is clear: teams that treat deliverability like site reliability (SRE) recover fast. Teams that treat it like a weekly vanity metric get blindsided.
This guide shows you how to:
- Pick the right KPIs (and ignore misleading ones).
- Slice data by carrier, sender pool, route, and campaign.
- Build a dashboard and alerting system that catches issues early.
- Use monitoring to improve deliverability, not just report it.
Section 1: The core SMS deliverability KPIs that actually matter
You don’t need 40 metrics. You need a small set of KPIs that map directly to incidents and recovery.
1. Delivered rate (by carrier, pool, campaign)
Definition:
- Delivered rate = messages with positive “delivered” receipts ÷ total send attempts
Best practice:
- Always slice by:
- Carrier (Verizon, AT&T, T-Mobile, international operators)
- Sender pool / grid
- Campaign / flow (OTP, promos, transactional)
- Country / region
What “good” looks like (US A2P, properly configured):
- Core transactional flows: 99%+
- High-volume promos: 98–99%+
- Anything consistently under 97–98% needs investigation.
2. Hard-fail / error rate
Definition:
- Percentage of messages with definitive failure codes:
- Invalid number
- Unknown subscriber
- Permanent carrier rejection
Why it matters:
- Rising hard-fails often mean:
- Poor list hygiene.
- Carrier-level blocking of specific senders or content.
- A tired or burned-out number pool.
Watch for:
- Sudden jumps on a single carrier.
- Specific routes or pools with >1–2% persistent hard-fail rate.
3. Soft-fail / retry rate
Definition:
- Temporary failures:
- Network issues
- Congestion
- Rate limiting / throttling
Why it matters:
- High soft-fails = you’re pushing carriers too hard or hitting congested routes.
- Shows whether your retry strategy is working or just hammering.
4. Unknown / filtered / “fake delivered” indicators
Carriers don’t always give a “filtered” code. Some:
- Return generic errors.
- Claim “delivered” while devices get nothing (shadow filtering).
Proxies to monitor:
- Drops in downstream behavior (clicks, logins) despite “OK” receipts.
- Sampling tests: seed numbers on each carrier that you log separately.
- Sudden performance drops on new campaigns while others stay stable.
5. Pool and grid health
If you use:
- Burner Number Pools
- Private Pool Grids
- Or even simple dedicated numbers
…you should track, per pool/grid:
- Delivered rate
- Hard-fail rate
- Complaint / opt-out rate
- Daily messages per sender
Healthy patterns:
- Steady performance over time.
- No sender crossing:
- >1% hard-fail in a 24-hour window.
- >0.3–0.5% complaint / opt-out on promos.
Section 2: The “deliverability cube,” how to segment your data
A single global “delivery rate” hides everything.
You need a deliverability cube:
- Carrier (Verizon, AT&T, T-Mobile, etc.)
- Sender (pool, grid, individual number)
- Route / product (gateway, region)
- Campaign / flow (OTP, promos, transactional)
- Content risk level (mainstream, high-risk, SHAFT)
Example slice that catches real issues
-
Verizon × Promo × Grid A:
- Delivered rate drops from 99.1% → 94.4% over 48 hours.
- Hard-fails and soft-fails slightly up.
- Other carriers are stable.
-
Action:
- Shift promos from Grid A to Grid B for Verizon.
- Inspect recent content changes and velocity patterns.
- Temporarily reduce volume to baseline + 20% while you test.
Without segmentation, you’d only see:
- Global delivered: 97.8% → 96.9% (shrug).
With segmentation, you see:
- One output in the matrix is burning out while others are healthy.
Section 3: Alert thresholds and what to do when they fire
1. Carrier-specific delivered rate alerts
Recommended thresholds (adjust per baseline):
- Alert if delivered rate on any major carrier:
- Drops >2 points from 7‑day median.
- Or falls below 97% for longer than 30–60 minutes on active traffic.
Runbook:
- Confirm it’s not a data glitch (dashboards, raw logs).
- Check:
- Recent deploys (content changes, routing changes).
- New campaign launches.
- Volume spikes.
- Mitigate:
- Temporarily reduce sending velocity on that carrier.
- Switch to alternative pool / grid if available.
- Pause new risky campaigns for that carrier.
2. Pool / grid health alerts
Alert when:
- Any pool or grid’s hard-fail rate exceeds 1–2% for >1 hour on meaningful volume.
- Complaint / opt-out rates exceed 0.3–0.5% on promos.
Runbook:
- Stop sending new campaigns on that pool / grid.
- Shift some traffic to healthier pools.
- Investigate:
- Did you mix higher-risk content onto a formerly clean pool?
- Did carrier policies change (e.g., new rule on SHAFT keywords)?
3. Shadow filtering & “fake delivery” alerts
Because you won’t always see clear error codes:
- Compare:
- Delivered messages → expected conversions (clicks, logins, OTP uses).
- Alert when:
- Deliverability stays “good” but downstream conversion falls sharply for one carrier or campaign.
This is where:
- Seed numbers per carrier are invaluable.
- Periodic live tests (manual + automated) catch reality vs receipts.
Section 4: Designing the SMS deliverability dashboard
Your dashboard doesn’t have to be fancy. It has to be useful under pressure.
Layout 1: Executive overview
Top-level tiles:
- Global delivered rate (last 24h, 7d)
- Per-carrier delivered rate (Verizon, AT&T, T-Mobile, top 3–5 internationals)
- % messages by:
- Transactional vs marketing
- Mainstream vs high-risk
Trends:
- Line charts:
- Delivered rate by carrier over time.
- Volume by carrier.
Use this to answer: “Are we on fire, yes or no?”
Layout 2: Ops / SRE view
Tables and charts by:
- Carrier × Pool × Campaign
- Pool health metrics (delivered, hard-fail, soft-fail, complaints)
Examples:
- Heatmap: delivered rate by carrier (columns) and pool/grid (rows).
- Table with sorting:
- “Show pools with highest hard-fail rate today.”
Use this when an alert fires.
Layout 3: Analytics / marketing view
Focus on:
- Campaign performance:
- Delivered rate vs CTR vs conversion.
- A/B tests:
- Content variants vs deliverability.
This view bridges deliverability and revenue, making it easier to justify infra decisions.
Section 5: Diagnosing common issues with your metrics
Scenario 1: One carrier tanks, others are stable
Likely causes:
- Carrier‑specific filtering on:
- Content pattern.
- URL domain.
- Sender pool reputation.
What to check:
- Any recent content or template changes?
- New URLs being used? (e.g., changed link shortener)
- Volume ramp: did you spike too fast on that carrier?
Scenario 2: All carriers degrade at once
Likely causes:
- Global content change (e.g., more aggressive promos).
- Aggressive volume ramp across the board.
- Platform-level change (routing, pool logic).
What to check:
- Last few deployments.
- New high-risk campaigns.
- Whether controls (burner logic, per-carrier caps) are actually enforced.
Scenario 3: Metrics look fine, but support inbox fills with “I didn’t get it”
Likely causes:
- Device-level filtering (spam folders).
- Shadow filtering at carrier level with misleading receipts.
- Regional pockets affected (e.g., specific area codes).
What to check:
- Seed device tests on each carrier.
- Region / area-code breakdowns.
- Presence of sensitive keywords or patterns.
Section 6: How deliverability monitoring changes your infrastructure choices
Once you see:
- Which pools degrade fastest
- Which carriers are most sensitive
- How content and volume affect outcomes
…it becomes obvious why infrastructure matters.
Teams that move to:
- Private Pool Grids (100+ multi‑carrier SIMs per grid)
- Carrier‑matching algorithms (Verizon→Verizon, AT&T→AT&T)
- Burner Number Pools with automated retirement
…can use their dashboards to:
- Proactively rotate and cool down senders.
- A/B test routing strategies, not just content.
- Create per‑carrier playbooks instead of generic fixes.
We regularly see:
- 40–60% fewer incidents after deploying proper monitoring and grid‑based routing.
- Faster RCA (root cause analysis) because logs and metrics line up.
- Better risk conversations with compliance and legal (“here’s exactly how we’re controlling abuse and monitoring complaints”).
FAQ: SMS deliverability metrics & dashboards
1. What’s a “good” global delivery rate?
For a healthy, well‑architected program:
- Transactional flows: 99%+
- High-volume marketing: 98–99%
Anything under 97–98% on core flows is a red flag.
2. How often should we check deliverability?
- Dashboards: daily (or more during launches).
- Alerts: real-time for significant drops.
- Deep reviews: weekly or monthly with trend analysis.
3. Do I really need per‑carrier data?
Yes. Most serious incidents are carrier-specific. Without per‑carrier slices, you’re flying blind.
4. What about small senders? Is this overkill?
If you:
- Send low volume.
- Operate in low‑risk verticals.
- Don’t drive mission‑critical revenue via SMS.
…you can get away with simpler monitoring. But the moment SMS is core revenue, you’ll wish you had this in place.
5. How do I start if my current provider doesn’t expose good metrics?
Options:
- Pull CDRs / logs and build your own aggregation.
- Use webhooks to log DLRs into your data warehouse.
- Consider a gateway that exposes carrier‑level data by design.
6. How does this relate to A2P 10DLC registration?
10DLC compliance affects:
- Allowed volume.
- Scrutiny level.
- Penalties for abuse.
Monitoring delivers the feedback loop that tells you if:
- Your campaigns are behaving within carrier expectations.
- You’re about to trip a threshold.
7. Can monitoring fix bad content or consent?
No. It can only tell you:
- How bad things are.
- Where they’re bad.
You still need clean opt-in, clear messaging, and respect for local law.
8. How do I detect device-level spam filtering?
- Seed devices across carriers and platforms (iOS/Android).
- Correlate “delivered” receipts with real device receipts and behavior.
9. Where does privacy fit into all this?
A privacy‑first gateway should:
- Minimize stored PII.
- Offer clear data retention controls.
- Still provide aggregated metrics without leaking sensitive content.
10. Do I need a dedicated deliverability engineer?
Not necessarily. But you do need:
- Clear ownership (someone accountable).
- Runbooks and dashboards that non‑experts can follow in an incident.
Conclusion: make deliverability observable before it becomes expensive
You can’t fix what you can’t see.
A basic deliverability dashboard and alerting setup can:
- Catch carrier‑specific issues before they explode.
- Prove the ROI of better infrastructure (carrier matching, private grids).
- Turn SMS from a black box into an operationally managed system.
If SMS is tied to revenue, treat it like an SRE problem:
- Instrument it.
- Alert on it.
- Build runbooks around it.
Once you have that in place, you’re in a perfect position to evaluate whether a private, carrier-matching gateway is worth it, because you’ll have hard data showing where your current provider is leaving money on the table.
Dach SMS Lab