Developer Deep-Dive: Carrier-Matching SMS Routing Architecture & Algorithms

Introduction: behind the scenes of carrier-matching SMS routing

Most SMS providers describe routing as a black box:

“We’ll use the best route based on quality and price.”

If you’re an engineer or architect responsible for uptime, that’s not enough. You need to know:

  • Which path a message took.
  • Which sender it used.
  • Why the system chose that combination.
  • How it will behave under failures and spikes.

Carrier-matching routing is one of the key reasons some gateways consistently hit 99.4%+ deliverability in tough verticals while others stagnate in the low‑ to mid‑90s. In this post we’ll deep‑dive the architecture:

  • The intelligence layer (carrier and line‑type detection).
  • The routing decision engine.
  • Pool/grid selection and rotation logic.
  • Fallback strategies.
  • Observability and debugging.

This isn’t vendor marketing. It’s the practical architecture we’ve seen work across millions of messages per day.


Section 1: What “carrier matching” actually means

At a high level, carrier matching is:

For each destination number, choose a sender and route that best match the destination’s carrier and context.

Instead of:

  • Sending everything via cheapest generic routes.
  • Mixing all carriers and use-cases on the same senders.

Carrier-matching aims to:

  • Use Verizon-vetted senders for Verizon subscribers.
  • Use AT&T-vetted senders for AT&T subscribers.
  • Keep per-carrier reputation isolated and predictable.

Benefits we see in practice:

  • 3–12 point deliverability uplift on specific carriers compared to generic routing.
  • Lower variance in performance over time.
  • Cleaner root-cause analysis when issues arise (one carrier, one grid).

Section 2: High-level architecture

A carrier-matching SMS gateway typically has these components:

  1. Ingress API

    • Receives message requests (/messages).
    • Validates payload, auth, and basic schema.
  2. Normalization & enrichment

    • Normalizes phone numbers (E.164).
    • Enriches with:
      • Carrier info.
      • Country/region.
      • Line type (mobile, VoIP, landline when available).
      • Risk signals.
  3. Routing decision engine

    • Given the enriched context + app metadata:
      • Chooses route profile (e.g., OTP_US, Promo_EU).
      • Selects a pool/grid.
      • Picks a sender within that grid.
    • Applies per-carrier and per-grid rules.
  4. Queueing & dispatch

    • Places messages into per-route queues.
    • Applies:
      • Rate limiting.
      • Burst control.
      • Retry strategies.
  5. Delivery receipts & feedback

    • Ingests DLRs (delivery receipts).
    • Updates:
      • Pool/grid health.
      • Sender reputation metrics.
    • Feeds back into routing decisions.
  6. Observability plane

    • Metrics, logs, traces.
    • Queryable by:
      • Carrier.
      • Pool/grid.
      • Sender.
      • Campaign.

Section 3: The carrier intelligence layer

Before you can match carriers, you need to know them.

Inputs

  • Phone number in E.164 format.
  • Optionally:
    • Country code from app context.
    • Known user metadata (e.g., previously resolved carrier).

Sources

  • HLR / carrier lookup providers.
  • Phone number intelligence APIs.
  • Internal caches (recently resolved numbers).

Outputs

For a given destination:

  • carrier_id: e.g., verizon_us, att_us, tmobile_us, o2_uk, etc.
  • country_code: US, GB, DE, etc.
  • line_type: mobile, fixed, voip (when available).
  • risk_flags: ported recently, suspicious ranges, etc. (optional).

Caching strategy

  • Warm caches on:
    • High-traffic destinations.
    • Known frequent senders (e.g., OTP-heavy users).
  • Respect:
    • Lookup provider rate limits.
    • Data freshness constraints.

Example (pseudo-code):

type CarrierInfo = {
  carrierId: string;
  country: string;
  lineType?: string;
  lastUpdated: number;
};

async function resolveCarrier(msisdn: string): Promise<CarrierInfo> {
  const cached = await carrierCache.get(msisdn);
  if (cached && Date.now() - cached.lastUpdated < CACHE_TTL_MS) {
    return cached;
  }

  const lookup = await externalLookup(msisdn);

  const info: CarrierInfo = {
    carrierId: lookup.carrierId,
    country: lookup.countryCode,
    lineType: lookup.lineType,
    lastUpdated: Date.now(),
  };

  carrierCache.set(msisdn, info);
  return info;
}

Section 4: Routing decision engine design

Given:

  • Enriched message context (CarrierInfo, country, app metadata).
  • Message type (OTP, transactional, marketing).
  • Customer/account configuration.

The routing engine must pick:

  1. Route profile

    • E.g., OTP_US, PROMO_US, ALERT_EU, etc.
    • Encapsulates:
      • Preferred carriers/routes.
      • Throughput caps.
      • Allowed sender types.
  2. Pool / grid

    • E.g., US_OTP_VERIZON_GRID_A, US_PROMO_ATT_GRID_B.
    • Each grid:
      • Represents a collection of SIMs/numbers.
      • Has per-carrier capacity and health metrics.
  3. Sender within the grid

    • Based on:
      • Rotation strategy.
      • Health.
      • Local constraints.

Decision flow (simplified)

function routeMessage(msg: Message, carrier: CarrierInfo): RouteDecision {
  const profile = selectProfile(msg, carrier);

  const candidateGrids = findEligibleGrids(profile, carrier);

  const grid = selectBestGrid(candidateGrids);

  const sender = pickSenderFromGrid(grid, msg);

  return { profileId: profile.id, gridId: grid.id, senderId: sender.id };
}

Where:

  • selectProfile uses:

    • Message type (OTP vs promo).
    • Country/region.
    • Risk/vertical (e.g., crypto/adult).
  • findEligibleGrids filters by:

    • Country.
    • Carrier compatibility.
    • Health thresholds.
  • selectBestGrid might:

    • Prefer grids with:
      • Healthy error/complaint rates.
      • Available capacity.
    • Avoid:
      • Grids approaching thresholds.
  • pickSenderFromGrid:

    • Implements rotation:
      • Round-robin.
      • Weighted.
      • Health-aware (avoid bad senders).

Section 5: Pool/grids and rotation logic

Grids as the main unit of isolation

A grid might be defined by:

  • Region: US.
  • Carrier mix: Verizon-only, AT&T-only, multi-carrier.
  • Use-case: OTP, PROMO, ALERT.
  • Priority level.

Each grid tracks:

  • Total sends.
  • Delivered/failed breakdown.
  • Hard-fail codes.
  • Complaint/unsub rates.

Rotation strategies

Simplest:

  • Round-robin across active senders.

Better:

  • Health-aware rotation:
    • Skip senders with:
      • High recent error rates.
      • High complaint ratios.
    • Weight in favor of:
      • Newer, healthy senders.

Example:

function pickSenderFromGrid(grid: GridState): Sender {
  const healthy = grid.senders.filter((s) => s.healthScore > MIN_HEALTH);
  const weighted = buildWeightedList(healthy, (s) => s.weight);
  return randomChoice(weighted);
}

With:

  • healthScore based on:
    • Recent delivered rate.
    • Hard-fail rate.
    • Complaint rate.
    • Time since last verification/warmup.

Retirement and cooldown

Implement rules like:

  • Retire or cool a sender when:
    • Hard-fail > 1–2% over last N messages.
    • Complaints > 0.3–0.5% in a period.
    • Carrier-specific error codes spike.

Retired senders:

  • Are taken out of active rotation.
  • May be re‑tested later with small, safe traffic.

Section 6: Fallbacks, retries, and failure modes

Even with good routing, things break:

  • Carriers have outages.
  • Specific routes become degraded.
  • A grid gets temporarily burned.

Fallback principles

  1. Prefer in‑family fallbacks first

    • Move from Grid A → Grid B within the same profile/country.
    • Keep OTP on OTP grids, promos on promo grids.
  2. Avoid instant, repeated retries on the same broken path

    • Back off aggressively:
      • Exponential or linear backoff.
    • Mark failing routes/grids as degraded.
  3. Graceful degradation

    • For OTP:
      • Try alternative sender within same carrier family.
      • Consider slower, but more reliable fallback.
    • For promos:
      • Reduce send rate.
      • Defer sends if carriers are clearly unstable.

Example retry logic (simplified)

async function dispatchMessage(decision: RouteDecision, msg: Message) {
  try {
    const result = await sendToCarrier(decision, msg);

    updateMetrics(decision, result);
    return result;
  } catch (err) {
    markRouteAsDegraded(decision, err);

    const fallbackDecision = findFallback(decision, msg);
    if (!fallbackDecision) throw err;

    const fallbackResult = await sendToCarrier(fallbackDecision, msg);
    updateMetrics(fallbackDecision, fallbackResult);
    return fallbackResult;
  }
}

Section 7: Observability, logging, and debugging

Carrier-matching routing is only as good as its observability.

You want to be able to ask:

  • “Show me all messages to Verizon in the last 24h routed via Grid A vs Grid B.”
  • “Which senders in Grid C have the highest hard-fail rate?”
  • “What changed around the time deliverability dropped?”

Minimal log fields

For each message:

  • message_id
  • timestamp
  • customer_id (or project/app ID)
  • destination_msisdn (hashed/pseudonymized if needed)
  • carrier_id
  • country_code
  • profile_id
  • grid_id
  • sender_id
  • route_id / upstream ID
  • status (queued, sent, delivered, failed, unknown)
  • error_code (if any)
  • dlr_timestamp
  • latency_ms
  • campaign_id or flow_id (if applicable)

Dashboards

  • Carrier × grid heatmaps:
    • Delivered rate.
    • Hard-fail rate.
  • Sender leaderboards:
    • Sorted by health and throughput.
  • Anomaly detection:
    • Alerts when:
      • Carrier X, Grid Y delivered rate falls below threshold.
      • Error codes spike.

Example incident workflow

  1. Alert: “Verizon deliverability dropped >3 points on Grid US_PROMO_A.”
  2. Use logs:
    • Check error codes and volumes.
    • Compare with other grids.
  3. Mitigate:
    • Temporarily move Verizon promo traffic to Grid US_PROMO_B.
    • Reduce send rate.
  4. Investigate:
    • Recent content/template changes.
    • Changes to routing configuration.

FAQ: Carrier-matching routing for developers

1. Do we need HLR/lookup for every message?

Not necessarily.

Options:

  • Cache results for a reasonable TTL.
  • Resolve ahead of time for high-traffic users.
  • Batch lookups when seeding grids.

2. How do we handle number portability?

Ported numbers can change carriers. Good practices:

  • Periodically refresh carrier info for:
    • High-frequency destinations.
    • Numbers with repeated failures.

3. Is carrier matching only relevant in the US?

No. It’s especially useful:

  • Wherever multiple operators behave differently.
  • Where sender IDs and templates are operator-specific (many EU/APAC markets).

4. How does this interact with A2P 10DLC and registered campaigns?

Carrier matching:

  • Uses the registered campaigns and senders per carrier correctly.
  • Helps you stay within throughput and content expectations per campaign.

5. What about privacy and PII?

A privacy-first implementation:

  • Hashes MSISDNs in logs.
  • Stores minimal data.
  • Keeps carrier and routing metadata, not raw content.

6. Can we layer carrier matching on top of an existing CPaaS?

Sometimes:

  • If the CPaaS exposes:
    • Per-carrier controls.
    • Per-sender statistics.
  • You can build a meta-routing layer on top.

But the strongest forms are with owned infrastructure (SIMs, private grids).


Conclusion: from best-effort to engineered routing

Most SMS programs live on best-effort routing:

  • Provider chooses cheap/available routes.
  • You get 1–2 metrics.
  • You hope for the best.

Carrier-matching routing turns SMS into an engineered system:

  • Deterministic per-carrier path choices.
  • Isolated grids and pools.
  • Health-aware rotation and fallbacks.
  • Rich observability for incidents.

If you care about:

  • Hitting and sustaining 99.4%+ deliverability.
  • Surviving promo spikes and high-risk use cases.
  • Giving your SRE/infra team levers they can understand and trust.

…then implementing or choosing a gateway with serious carrier-matching architecture isn’t a nice‑to‑have, it’s the only sane long‑term strategy.

Dach SMS Lab

Dach SMS Lab