Introduction: behind the scenes of carrier-matching SMS routing

Most SMS providers describe routing as a black box:

“We’ll use the best route based on quality and price.”

If you’re an engineer or architect responsible for uptime, that’s not enough. You need to know:

Which path a message took.
Which sender it used.
Why the system chose that combination.
How it will behave under failures and spikes.

Carrier-matching routing is one of the key reasons some gateways consistently hit 99.4%+ deliverability in tough verticals while others stagnate in the low‑ to mid‑90s. In this post we’ll deep‑dive the architecture:

The intelligence layer (carrier and line‑type detection).
The routing decision engine.
Pool/grid selection and rotation logic.
Fallback strategies.
Observability and debugging.

This isn’t vendor marketing. It’s the practical architecture we’ve seen work across millions of messages per day.

Section 1: What “carrier matching” actually means

At a high level, carrier matching is:

For each destination number, choose a sender and route that best match the destination’s carrier and context.

Instead of:

Sending everything via cheapest generic routes.
Mixing all carriers and use-cases on the same senders.

Carrier-matching aims to:

Use Verizon-vetted senders for Verizon subscribers.
Use AT&T-vetted senders for AT&T subscribers.
Keep per-carrier reputation isolated and predictable.

Benefits we see in practice:

3–12 point deliverability uplift on specific carriers compared to generic routing.
Lower variance in performance over time.
Cleaner root-cause analysis when issues arise (one carrier, one grid).

Section 2: High-level architecture

A carrier-matching SMS gateway typically has these components:

Ingress API
- Receives message requests (/messages).
- Validates payload, auth, and basic schema.
Normalization & enrichment
- Normalizes phone numbers (E.164).
- Enriches with:
  - Carrier info.
  - Country/region.
  - Line type (mobile, VoIP, landline when available).
  - Risk signals.
Routing decision engine
- Given the enriched context + app metadata:
  - Chooses route profile (e.g., OTP_US, Promo_EU).
  - Selects a pool/grid.
  - Picks a sender within that grid.
- Applies per-carrier and per-grid rules.
Queueing & dispatch
- Places messages into per-route queues.
- Applies:
  - Rate limiting.
  - Burst control.
  - Retry strategies.
Delivery receipts & feedback
- Ingests DLRs (delivery receipts).
- Updates:
  - Pool/grid health.
  - Sender reputation metrics.
- Feeds back into routing decisions.
Observability plane
- Metrics, logs, traces.
- Queryable by:
  - Carrier.
  - Pool/grid.
  - Sender.
  - Campaign.

Section 3: The carrier intelligence layer

Before you can match carriers, you need to know them.

Inputs

Phone number in E.164 format.
Optionally:
- Country code from app context.
- Known user metadata (e.g., previously resolved carrier).

Sources

HLR / carrier lookup providers.
Phone number intelligence APIs.
Internal caches (recently resolved numbers).

Outputs

For a given destination:

carrier_id: e.g., verizon_us, att_us, tmobile_us, o2_uk, etc.
country_code: US, GB, DE, etc.
line_type: mobile, fixed, voip (when available).
risk_flags: ported recently, suspicious ranges, etc. (optional).

Caching strategy

Warm caches on:
- High-traffic destinations.
- Known frequent senders (e.g., OTP-heavy users).
Respect:
- Lookup provider rate limits.
- Data freshness constraints.

Example (pseudo-code):

type CarrierInfo = {
  carrierId: string;
  country: string;
  lineType?: string;
  lastUpdated: number;
};

async function resolveCarrier(msisdn: string): Promise<CarrierInfo> {
  const cached = await carrierCache.get(msisdn);
  if (cached && Date.now() - cached.lastUpdated < CACHE_TTL_MS) {
    return cached;
  }

  const lookup = await externalLookup(msisdn);

  const info: CarrierInfo = {
    carrierId: lookup.carrierId,
    country: lookup.countryCode,
    lineType: lookup.lineType,
    lastUpdated: Date.now(),
  };

  carrierCache.set(msisdn, info);
  return info;
}

Section 4: Routing decision engine design

Given:

Enriched message context (CarrierInfo, country, app metadata).
Message type (OTP, transactional, marketing).
Customer/account configuration.

The routing engine must pick:

Route profile
- E.g., OTP_US, PROMO_US, ALERT_EU, etc.
- Encapsulates:
  - Preferred carriers/routes.
  - Throughput caps.
  - Allowed sender types.
Pool / grid
- E.g., US_OTP_VERIZON_GRID_A, US_PROMO_ATT_GRID_B.
- Each grid:
  - Represents a collection of SIMs/numbers.
  - Has per-carrier capacity and health metrics.
Sender within the grid
- Based on:
  - Rotation strategy.
  - Health.
  - Local constraints.

Decision flow (simplified)

function routeMessage(msg: Message, carrier: CarrierInfo): RouteDecision {
  const profile = selectProfile(msg, carrier);

  const candidateGrids = findEligibleGrids(profile, carrier);

  const grid = selectBestGrid(candidateGrids);

  const sender = pickSenderFromGrid(grid, msg);

  return { profileId: profile.id, gridId: grid.id, senderId: sender.id };
}

Where:

selectProfile uses:
- Message type (OTP vs promo).
- Country/region.
- Risk/vertical (e.g., crypto/adult).
findEligibleGrids filters by:
- Country.
- Carrier compatibility.
- Health thresholds.
selectBestGrid might:
- Prefer grids with:
  - Healthy error/complaint rates.
  - Available capacity.
- Avoid:
  - Grids approaching thresholds.
pickSenderFromGrid:
- Implements rotation:
  - Round-robin.
  - Weighted.
  - Health-aware (avoid bad senders).

Section 5: Pool/grids and rotation logic

Grids as the main unit of isolation

A grid might be defined by:

Region: US.
Carrier mix: Verizon-only, AT&T-only, multi-carrier.
Use-case: OTP, PROMO, ALERT.
Priority level.

Each grid tracks:

Total sends.
Delivered/failed breakdown.
Hard-fail codes.
Complaint/unsub rates.

Rotation strategies

Simplest:

Round-robin across active senders.

Better:

Health-aware rotation:
- Skip senders with:
  - High recent error rates.
  - High complaint ratios.
- Weight in favor of:
  - Newer, healthy senders.

Example:

function pickSenderFromGrid(grid: GridState): Sender {
  const healthy = grid.senders.filter((s) => s.healthScore > MIN_HEALTH);
  const weighted = buildWeightedList(healthy, (s) => s.weight);
  return randomChoice(weighted);
}

With:

healthScore based on:
- Recent delivered rate.
- Hard-fail rate.
- Complaint rate.
- Time since last verification/warmup.

Retirement and cooldown

Implement rules like:

Retire or cool a sender when:
- Hard-fail > 1–2% over last N messages.
- Complaints > 0.3–0.5% in a period.
- Carrier-specific error codes spike.

Retired senders:

Are taken out of active rotation.
May be re‑tested later with small, safe traffic.

Section 6: Fallbacks, retries, and failure modes

Even with good routing, things break:

Carriers have outages.
Specific routes become degraded.
A grid gets temporarily burned.

Fallback principles

Prefer in‑family fallbacks first
- Move from Grid A → Grid B within the same profile/country.
- Keep OTP on OTP grids, promos on promo grids.
Avoid instant, repeated retries on the same broken path
- Back off aggressively:
  - Exponential or linear backoff.
- Mark failing routes/grids as degraded.
Graceful degradation
- For OTP:
  - Try alternative sender within same carrier family.
  - Consider slower, but more reliable fallback.
- For promos:
  - Reduce send rate.
  - Defer sends if carriers are clearly unstable.

Example retry logic (simplified)

async function dispatchMessage(decision: RouteDecision, msg: Message) {
  try {
    const result = await sendToCarrier(decision, msg);

    updateMetrics(decision, result);
    return result;
  } catch (err) {
    markRouteAsDegraded(decision, err);

    const fallbackDecision = findFallback(decision, msg);
    if (!fallbackDecision) throw err;

    const fallbackResult = await sendToCarrier(fallbackDecision, msg);
    updateMetrics(fallbackDecision, fallbackResult);
    return fallbackResult;
  }
}

Section 7: Observability, logging, and debugging

Carrier-matching routing is only as good as its observability.

You want to be able to ask:

“Show me all messages to Verizon in the last 24h routed via Grid A vs Grid B.”
“Which senders in Grid C have the highest hard-fail rate?”
“What changed around the time deliverability dropped?”

Minimal log fields

For each message:

message_id
timestamp
customer_id (or project/app ID)
destination_msisdn (hashed/pseudonymized if needed)
carrier_id
country_code
profile_id
grid_id
sender_id
route_id / upstream ID
status (queued, sent, delivered, failed, unknown)
error_code (if any)
dlr_timestamp
latency_ms
campaign_id or flow_id (if applicable)

Dashboards

Carrier × grid heatmaps:
- Delivered rate.
- Hard-fail rate.
Sender leaderboards:
- Sorted by health and throughput.
Anomaly detection:
- Alerts when:
  - Carrier X, Grid Y delivered rate falls below threshold.
  - Error codes spike.

Example incident workflow

Alert: “Verizon deliverability dropped >3 points on Grid US_PROMO_A.”
Use logs:
- Check error codes and volumes.
- Compare with other grids.
Mitigate:
- Temporarily move Verizon promo traffic to Grid US_PROMO_B.
- Reduce send rate.
Investigate:
- Recent content/template changes.
- Changes to routing configuration.

FAQ: Carrier-matching routing for developers

1. Do we need HLR/lookup for every message?

Not necessarily.

Options:

Cache results for a reasonable TTL.
Resolve ahead of time for high-traffic users.
Batch lookups when seeding grids.

2. How do we handle number portability?

Ported numbers can change carriers. Good practices:

Periodically refresh carrier info for:
- High-frequency destinations.
- Numbers with repeated failures.

3. Is carrier matching only relevant in the US?

No. It’s especially useful:

Wherever multiple operators behave differently.
Where sender IDs and templates are operator-specific (many EU/APAC markets).

4. How does this interact with A2P 10DLC and registered campaigns?

Carrier matching:

Uses the registered campaigns and senders per carrier correctly.
Helps you stay within throughput and content expectations per campaign.

5. What about privacy and PII?

A privacy-first implementation:

Hashes MSISDNs in logs.
Stores minimal data.
Keeps carrier and routing metadata, not raw content.

6. Can we layer carrier matching on top of an existing CPaaS?

Sometimes:

If the CPaaS exposes:
- Per-carrier controls.
- Per-sender statistics.
You can build a meta-routing layer on top.

But the strongest forms are with owned infrastructure (SIMs, private grids).

Conclusion: from best-effort to engineered routing

Most SMS programs live on best-effort routing:

Provider chooses cheap/available routes.
You get 1–2 metrics.
You hope for the best.

Carrier-matching routing turns SMS into an engineered system:

Deterministic per-carrier path choices.
Isolated grids and pools.
Health-aware rotation and fallbacks.
Rich observability for incidents.

If you care about:

Hitting and sustaining 99.4%+ deliverability.
Surviving promo spikes and high-risk use cases.
Giving your SRE/infra team levers they can understand and trust.

…then implementing or choosing a gateway with serious carrier-matching architecture isn’t a nice‑to‑have, it’s the only sane long‑term strategy.