Developer Deep-Dive: Carrier-Matching SMS Routing Architecture & Algorithms
Introduction: behind the scenes of carrier-matching SMS routing
Most SMS providers describe routing as a black box:
“We’ll use the best route based on quality and price.”
If you’re an engineer or architect responsible for uptime, that’s not enough. You need to know:
- Which path a message took.
- Which sender it used.
- Why the system chose that combination.
- How it will behave under failures and spikes.
Carrier-matching routing is one of the key reasons some gateways consistently hit 99.4%+ deliverability in tough verticals while others stagnate in the low‑ to mid‑90s. In this post we’ll deep‑dive the architecture:
- The intelligence layer (carrier and line‑type detection).
- The routing decision engine.
- Pool/grid selection and rotation logic.
- Fallback strategies.
- Observability and debugging.
This isn’t vendor marketing. It’s the practical architecture we’ve seen work across millions of messages per day.
Section 1: What “carrier matching” actually means
At a high level, carrier matching is:
For each destination number, choose a sender and route that best match the destination’s carrier and context.
Instead of:
- Sending everything via cheapest generic routes.
- Mixing all carriers and use-cases on the same senders.
Carrier-matching aims to:
- Use Verizon-vetted senders for Verizon subscribers.
- Use AT&T-vetted senders for AT&T subscribers.
- Keep per-carrier reputation isolated and predictable.
Benefits we see in practice:
- 3–12 point deliverability uplift on specific carriers compared to generic routing.
- Lower variance in performance over time.
- Cleaner root-cause analysis when issues arise (one carrier, one grid).
Section 2: High-level architecture
A carrier-matching SMS gateway typically has these components:
-
Ingress API
- Receives message requests (
/messages). - Validates payload, auth, and basic schema.
- Receives message requests (
-
Normalization & enrichment
- Normalizes phone numbers (E.164).
- Enriches with:
- Carrier info.
- Country/region.
- Line type (mobile, VoIP, landline when available).
- Risk signals.
-
Routing decision engine
- Given the enriched context + app metadata:
- Chooses route profile (e.g., OTP_US, Promo_EU).
- Selects a pool/grid.
- Picks a sender within that grid.
- Applies per-carrier and per-grid rules.
- Given the enriched context + app metadata:
-
Queueing & dispatch
- Places messages into per-route queues.
- Applies:
- Rate limiting.
- Burst control.
- Retry strategies.
-
Delivery receipts & feedback
- Ingests DLRs (delivery receipts).
- Updates:
- Pool/grid health.
- Sender reputation metrics.
- Feeds back into routing decisions.
-
Observability plane
- Metrics, logs, traces.
- Queryable by:
- Carrier.
- Pool/grid.
- Sender.
- Campaign.
Section 3: The carrier intelligence layer
Before you can match carriers, you need to know them.
Inputs
- Phone number in E.164 format.
- Optionally:
- Country code from app context.
- Known user metadata (e.g., previously resolved carrier).
Sources
- HLR / carrier lookup providers.
- Phone number intelligence APIs.
- Internal caches (recently resolved numbers).
Outputs
For a given destination:
carrier_id: e.g.,verizon_us,att_us,tmobile_us,o2_uk, etc.country_code:US,GB,DE, etc.line_type:mobile,fixed,voip(when available).risk_flags: ported recently, suspicious ranges, etc. (optional).
Caching strategy
- Warm caches on:
- High-traffic destinations.
- Known frequent senders (e.g., OTP-heavy users).
- Respect:
- Lookup provider rate limits.
- Data freshness constraints.
Example (pseudo-code):
type CarrierInfo = {
carrierId: string;
country: string;
lineType?: string;
lastUpdated: number;
};
async function resolveCarrier(msisdn: string): Promise<CarrierInfo> {
const cached = await carrierCache.get(msisdn);
if (cached && Date.now() - cached.lastUpdated < CACHE_TTL_MS) {
return cached;
}
const lookup = await externalLookup(msisdn);
const info: CarrierInfo = {
carrierId: lookup.carrierId,
country: lookup.countryCode,
lineType: lookup.lineType,
lastUpdated: Date.now(),
};
carrierCache.set(msisdn, info);
return info;
}
Section 4: Routing decision engine design
Given:
- Enriched message context (
CarrierInfo, country, app metadata). - Message type (OTP, transactional, marketing).
- Customer/account configuration.
The routing engine must pick:
-
Route profile
- E.g.,
OTP_US,PROMO_US,ALERT_EU, etc. - Encapsulates:
- Preferred carriers/routes.
- Throughput caps.
- Allowed sender types.
- E.g.,
-
Pool / grid
- E.g.,
US_OTP_VERIZON_GRID_A,US_PROMO_ATT_GRID_B. - Each grid:
- Represents a collection of SIMs/numbers.
- Has per-carrier capacity and health metrics.
- E.g.,
-
Sender within the grid
- Based on:
- Rotation strategy.
- Health.
- Local constraints.
- Based on:
Decision flow (simplified)
function routeMessage(msg: Message, carrier: CarrierInfo): RouteDecision {
const profile = selectProfile(msg, carrier);
const candidateGrids = findEligibleGrids(profile, carrier);
const grid = selectBestGrid(candidateGrids);
const sender = pickSenderFromGrid(grid, msg);
return { profileId: profile.id, gridId: grid.id, senderId: sender.id };
}
Where:
-
selectProfileuses:- Message type (OTP vs promo).
- Country/region.
- Risk/vertical (e.g., crypto/adult).
-
findEligibleGridsfilters by:- Country.
- Carrier compatibility.
- Health thresholds.
-
selectBestGridmight:- Prefer grids with:
- Healthy error/complaint rates.
- Available capacity.
- Avoid:
- Grids approaching thresholds.
- Prefer grids with:
-
pickSenderFromGrid:- Implements rotation:
- Round-robin.
- Weighted.
- Health-aware (avoid bad senders).
- Implements rotation:
Section 5: Pool/grids and rotation logic
Grids as the main unit of isolation
A grid might be defined by:
- Region:
US. - Carrier mix: Verizon-only, AT&T-only, multi-carrier.
- Use-case:
OTP,PROMO,ALERT. - Priority level.
Each grid tracks:
- Total sends.
- Delivered/failed breakdown.
- Hard-fail codes.
- Complaint/unsub rates.
Rotation strategies
Simplest:
- Round-robin across active senders.
Better:
- Health-aware rotation:
- Skip senders with:
- High recent error rates.
- High complaint ratios.
- Weight in favor of:
- Newer, healthy senders.
- Skip senders with:
Example:
function pickSenderFromGrid(grid: GridState): Sender {
const healthy = grid.senders.filter((s) => s.healthScore > MIN_HEALTH);
const weighted = buildWeightedList(healthy, (s) => s.weight);
return randomChoice(weighted);
}
With:
healthScorebased on:- Recent delivered rate.
- Hard-fail rate.
- Complaint rate.
- Time since last verification/warmup.
Retirement and cooldown
Implement rules like:
- Retire or cool a sender when:
- Hard-fail > 1–2% over last N messages.
- Complaints > 0.3–0.5% in a period.
- Carrier-specific error codes spike.
Retired senders:
- Are taken out of active rotation.
- May be re‑tested later with small, safe traffic.
Section 6: Fallbacks, retries, and failure modes
Even with good routing, things break:
- Carriers have outages.
- Specific routes become degraded.
- A grid gets temporarily burned.
Fallback principles
-
Prefer in‑family fallbacks first
- Move from Grid A → Grid B within the same profile/country.
- Keep OTP on OTP grids, promos on promo grids.
-
Avoid instant, repeated retries on the same broken path
- Back off aggressively:
- Exponential or linear backoff.
- Mark failing routes/grids as degraded.
- Back off aggressively:
-
Graceful degradation
- For OTP:
- Try alternative sender within same carrier family.
- Consider slower, but more reliable fallback.
- For promos:
- Reduce send rate.
- Defer sends if carriers are clearly unstable.
- For OTP:
Example retry logic (simplified)
async function dispatchMessage(decision: RouteDecision, msg: Message) {
try {
const result = await sendToCarrier(decision, msg);
updateMetrics(decision, result);
return result;
} catch (err) {
markRouteAsDegraded(decision, err);
const fallbackDecision = findFallback(decision, msg);
if (!fallbackDecision) throw err;
const fallbackResult = await sendToCarrier(fallbackDecision, msg);
updateMetrics(fallbackDecision, fallbackResult);
return fallbackResult;
}
}
Section 7: Observability, logging, and debugging
Carrier-matching routing is only as good as its observability.
You want to be able to ask:
- “Show me all messages to Verizon in the last 24h routed via Grid A vs Grid B.”
- “Which senders in Grid C have the highest hard-fail rate?”
- “What changed around the time deliverability dropped?”
Minimal log fields
For each message:
message_idtimestampcustomer_id(or project/app ID)destination_msisdn(hashed/pseudonymized if needed)carrier_idcountry_codeprofile_idgrid_idsender_idroute_id/ upstream IDstatus(queued, sent, delivered, failed, unknown)error_code(if any)dlr_timestamplatency_mscampaign_idorflow_id(if applicable)
Dashboards
- Carrier × grid heatmaps:
- Delivered rate.
- Hard-fail rate.
- Sender leaderboards:
- Sorted by health and throughput.
- Anomaly detection:
- Alerts when:
- Carrier X, Grid Y delivered rate falls below threshold.
- Error codes spike.
- Alerts when:
Example incident workflow
- Alert: “Verizon deliverability dropped >3 points on Grid US_PROMO_A.”
- Use logs:
- Check error codes and volumes.
- Compare with other grids.
- Mitigate:
- Temporarily move Verizon promo traffic to Grid US_PROMO_B.
- Reduce send rate.
- Investigate:
- Recent content/template changes.
- Changes to routing configuration.
FAQ: Carrier-matching routing for developers
1. Do we need HLR/lookup for every message?
Not necessarily.
Options:
- Cache results for a reasonable TTL.
- Resolve ahead of time for high-traffic users.
- Batch lookups when seeding grids.
2. How do we handle number portability?
Ported numbers can change carriers. Good practices:
- Periodically refresh carrier info for:
- High-frequency destinations.
- Numbers with repeated failures.
3. Is carrier matching only relevant in the US?
No. It’s especially useful:
- Wherever multiple operators behave differently.
- Where sender IDs and templates are operator-specific (many EU/APAC markets).
4. How does this interact with A2P 10DLC and registered campaigns?
Carrier matching:
- Uses the registered campaigns and senders per carrier correctly.
- Helps you stay within throughput and content expectations per campaign.
5. What about privacy and PII?
A privacy-first implementation:
- Hashes MSISDNs in logs.
- Stores minimal data.
- Keeps carrier and routing metadata, not raw content.
6. Can we layer carrier matching on top of an existing CPaaS?
Sometimes:
- If the CPaaS exposes:
- Per-carrier controls.
- Per-sender statistics.
- You can build a meta-routing layer on top.
But the strongest forms are with owned infrastructure (SIMs, private grids).
Conclusion: from best-effort to engineered routing
Most SMS programs live on best-effort routing:
- Provider chooses cheap/available routes.
- You get 1–2 metrics.
- You hope for the best.
Carrier-matching routing turns SMS into an engineered system:
- Deterministic per-carrier path choices.
- Isolated grids and pools.
- Health-aware rotation and fallbacks.
- Rich observability for incidents.
If you care about:
- Hitting and sustaining 99.4%+ deliverability.
- Surviving promo spikes and high-risk use cases.
- Giving your SRE/infra team levers they can understand and trust.
…then implementing or choosing a gateway with serious carrier-matching architecture isn’t a nice‑to‑have, it’s the only sane long‑term strategy.
Dach SMS Lab