Skip to main content

Consistency Models in Distributed Databases

Selecting the right consistency guarantee is the most consequential architectural decision in any partitioned system — get it wrong and you face either silent data corruption or unnecessary write latency. This page sits within Database Partitioning Fundamentals & Architecture and focuses on runtime enforcement: quorum routing, conflict resolution, and drift detection. For cost and scaling implications of replica counts, see Scaling Limits and Cost Tradeoffs.

Problem Framing

Consider a payment service partitioned across three regional nodes, each holding a shard of the accounts table. A debit write lands on the primary. Two milliseconds later, a mobile client reads the balance from a secondary replica that has not yet replicated the write. The user sees the old balance, retries, and a double-charge logic path opens. This is not a theoretical edge case — it is the default outcome unless you explicitly configure quorum reads and sticky routing.

The inverse problem is equally real: applying linearizable writes to every read path in a telemetry service turns a sub-millisecond fan-out into a 50 ms synchronous checkpoint, burning throughput for a workload that tolerates seconds of staleness. Both failure modes trace back to a single root cause: no explicit mapping from workload characteristics to consistency guarantees before routing rules were written.

Architecture Overview

The diagram below shows how a three-tier consistency stack connects application routing, the quorum layer, and the replica set. Reads and writes enter the router, which consults a quorum policy store before dispatching to nodes. The anti-entropy scanner runs asynchronously, comparing Merkle tree hashes across replicas and publishing divergence metrics.

Consistency Routing Architecture Three-tier diagram showing application routing through a quorum policy router to a primary and two replica nodes, with an asynchronous anti-entropy scanner comparing Merkle hashes across replicas and publishing lag metrics. Application reads + writes Quorum Router W + R > N policy session affinity timeout budget Primary WAL, sync commits Replica 1 async replication Replica 2 async replication Anti-Entropy Scanner Merkle hash diff lag metrics → Prometheus R/W W R R

Step 1 — Map Workloads to Consistency Tiers

Before writing a single routing rule, classify every service by latency budget and staleness tolerance. Over-provisioning strong consistency costs write throughput; under-provisioning it corrupts state silently.

Service Tier Guarantee Latency Budget Routing Strategy Fallback on Partition
Financial ledger Linearizable < 50 ms Write primary, quorum read Reject writes, return 503
Inventory / booking Read-your-writes < 100 ms Sticky routing to last-write node Serve stale + warn client
User sessions Read-your-writes < 100 ms Session affinity token Serve stale with short TTL
Search / feed ranking Monotonic reads < 200 ms Any replica with lag < threshold Serve older snapshot
Telemetry / logs Eventual < 20 ms Async fan-out to nearest replica Queue and retry on drop

Encode this mapping in a routing policy file (see Step 2 below), not in application code, so it can be changed without a deploy.

Step 2 — Configure Quorum Routing

The W + R > N identity guarantees that at least one node participates in both a write and the subsequent read, ensuring the read sees the latest committed value. With N=3 replicas, setting W=2 and R=2 satisfies the identity and tolerates one replica failure on either path.

// Dynamic quorum configuration for partitioned writes
const replicaCount = 3;

const quorumConfig = {
  consistency: "quorum",
  w: Math.ceil(replicaCount / 2) + 1,  // 2 of 3 — majority write quorum
  r: Math.ceil(replicaCount / 2) + 1,  // 2 of 3 — majority read quorum
  timeoutMs: 500,
  onPartition: "reject",               // halt writes; do not accept unacknowledged data
};

// W=2, R=2, N=3  →  W+R=4 > N=3  →  read-after-write is guaranteed

Inject the consistency policy at connection initialisation time, before ORM sessions are created, so legacy code paths cannot silently bypass quorum checks.

import psycopg2

def get_conn(tier: str) -> psycopg2.extensions.connection:
    """Return a connection with the correct consistency options for the service tier."""
    dsn = DSN_MAP[tier]
    conn = psycopg2.connect(dsn)
    if tier == "financial":
        conn.set_session(isolation_level="SERIALIZABLE", autocommit=False)
    elif tier in ("sessions", "inventory"):
        conn.set_session(isolation_level="READ COMMITTED", autocommit=False)
    else:
        conn.set_session(isolation_level="READ COMMITTED", autocommit=True)
    return conn

Step 3 — Implement Session Affinity for Read-Your-Writes

Strong quorum reads are expensive; for most workloads the cheaper guarantee is that this user sees their own writes. Track the last-write node per session and pin reads to it until lag drops below your staleness threshold.

import redis
import time

AFFINITY_TTL = 30  # seconds; release sticky route once replication catches up
STALENESS_THRESHOLD_MS = 200

def get_read_node(session_id: str, lag_map: dict[str, float]) -> str:
    """Return the sticky read node for a session, or any low-lag replica."""
    sticky = redis_client.get(f"affinity:{session_id}")
    if sticky:
        node = sticky.decode()
        if lag_map.get(node, 999) < STALENESS_THRESHOLD_MS:
            return node
    # Fall through: pick any replica within budget
    return min(lag_map, key=lag_map.get)

def record_write(session_id: str, node: str) -> None:
    """Pin this session to the node that accepted the write."""
    redis_client.setex(f"affinity:{session_id}", AFFINITY_TTL, node)

This avoids quorum overhead on read-heavy workloads while still preventing the most common consistency bug: a user refreshing a page immediately after a write and seeing stale data.

Step 4 — Deploy Anti-Entropy and Conflict Resolution

Eventually consistent partitions diverge silently without a background repair process. The two production-ready strategies are:

Last-Write-Wins (LWW): Simplest to implement. Each row carries a write timestamp; the higher timestamp wins on conflict. Requires hybrid logical clocks (HLC) or vector clocks rather than wall-clock time to handle NTP drift — bare NOW() timestamps produce incorrect orderings under concurrent writes from different nodes.

CRDTs (Conflict-Free Replicated Data Types): Required for counter, set, or collaborative workloads where all concurrent updates must survive. G-counters, PN-counters, and OR-sets resolve without central coordination. Use CRDTs when LWW would silently discard valid increments (for example, a view counter updated concurrently on two replicas).

The reconciliation script below implements LWW with Merkle-guided row selection to minimise the scan footprint:

#!/usr/bin/env python3
"""Cron-based reconciliation with LWW conflict resolution and structured logging."""
import logging
import psycopg2
from datetime import datetime, timezone

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

def reconcile_partition(
    partition_id: int,
    source_conn: psycopg2.extensions.connection,
    target_conn: psycopg2.extensions.connection,
    last_sync: datetime,
) -> int:
    """
    Pull rows modified after last_sync from source and apply LWW upsert on target.
    Returns the number of rows reconciled.
    """
    cursor_src = source_conn.cursor()
    cursor_src.execute(
        "SELECT id, updated_at, payload FROM partition_%s WHERE updated_at > %s",
        (partition_id, last_sync),
    )
    rows = cursor_src.fetchall()

    cursor_tgt = target_conn.cursor()
    reconciled = 0
    for row_id, updated_at, payload in rows:
        try:
            cursor_tgt.execute(
                """
                INSERT INTO partition_%s (id, updated_at, payload)
                VALUES (%s, %s, %s)
                ON CONFLICT (id) DO UPDATE
                  SET payload     = EXCLUDED.payload,
                      updated_at  = EXCLUDED.updated_at
                  WHERE EXCLUDED.updated_at > partition_%s.updated_at
                """,
                (partition_id, row_id, updated_at, payload, partition_id),
            )
            reconciled += 1
        except Exception as exc:
            logging.error(
                "LWW conflict on partition %s row %s: %s", partition_id, row_id, exc
            )
    target_conn.commit()
    logging.info("Partition %s: reconciled %d rows since %s", partition_id, reconciled, last_sync)
    return reconciled

For async replication patterns that extend this approach, see Implementing Eventual Consistency in Partitioned PostgreSQL.

Configuration Reference

Parameter Recommended Value Rationale
W (write quorum) ⌈N/2⌉ + 1 Majority write ensures durability under single-node failure
R (read quorum) ⌈N/2⌉ + 1 W + R > N guarantees at least one node in common
quorum_timeout_ms 500 ms Prevents indefinite blocking; triggers circuit breaker above this
session_affinity_ttl 30 s Releases sticky pin once replication lag normalises
staleness_threshold_ms 200 ms Release pin threshold; tune per SLA tier
anti_entropy_interval 60 s Merkle scan cadence; lower for tighter eventual windows
hlc_max_drift_ms 250 ms Reject LWW writes if HLC offset exceeds this value
replication_lag_alert_ms 500 ms Alert threshold before staleness SLA is breached

Operational Contrast

Unlike hash routing algorithms, which focus on even data distribution across shards, consistency routing is concerned with when a value is visible across nodes — distribution and visibility are orthogonal problems. A perfectly balanced hash-partitioned cluster still requires an explicit quorum policy to prevent stale reads.

Compared with proxy routing architectures, which centralise routing decisions at the infrastructure layer, application-level quorum enforcement keeps consistency logic co-located with the business tier. Proxy-level consistency is operationally simpler but adds a single-point-of-failure for the routing path; application-level consistency survives proxy restarts but requires careful library discipline to prevent bypasses.

Failure Modes

Quorum Timeout Cascade

Root cause: A slow replica holds up the quorum response, causing the router to wait until quorum_timeout_ms. Under load, timeout threads accumulate, exhausting the connection pool.

Detection:

rate(db_quorum_timeout_total[2m]) > 0.05

Mitigation: Reduce W temporarily to 1 and set onPartition: "degraded" to serve reads from any available node. Simultaneously isolate the slow replica and drain its connections.

Clock Skew Corrupting LWW Order

Root cause: NTP drift between nodes causes a later physical write to carry an earlier timestamp, silently overwriting valid data in an LWW scheme.

Detection:

max(db_hlc_drift_ms) by (node) > 250

Mitigation: Replace NOW() timestamps with hybrid logical clock values. Reject any write whose HLC offset exceeds hlc_max_drift_ms. Enforce NTP synchronisation to < 10 ms via chronyc tracking.

Replica Divergence Without Anti-Entropy

Root cause: A replica falls behind during a network partition and never catches up because the reconciliation job was not running or was silently failing.

Detection:

histogram_quantile(0.95,
  sum(rate(db_replication_lag_seconds_bucket[5m])) by (le, partition)
) > 0.5

Mitigation: Restart the reconciliation cron, verify it has write access to the target, and force a full Merkle resync by setting last_sync = epoch. Monitor row counts on both sides before returning the replica to the read pool.

Session Affinity Token Leak

Root cause: Sticky session tokens are never released — either because AFFINITY_TTL was not set or Redis evicted the key under memory pressure while lag remained high. Reads fall through to a lagging replica, violating read-your-writes.

Detection: Application-layer read-after-write test failing more than 0.1% of assertions.

Mitigation: Set maxmemory-policy allkeys-lru in Redis and monitor key eviction rates. Add a read-after-write assertion to your smoke test suite that fires on every deploy.

Common Mistakes

Applying strong consistency globally. Forces synchronous coordination across every partition on every write, increasing median write latency and triggering cascading quorum timeouts under load. Segment by service tier and apply linearizability only where the business genuinely requires it.

Using wall-clock timestamps for LWW conflict resolution. NTP drift between nodes produces incorrect orderings. A write that arrived later can carry an earlier timestamp and be silently discarded. Use hybrid logical clocks — they combine physical time with a monotonic counter to produce globally ordered timestamps without a coordination round-trip.

Never testing the partition-failure path. Most teams configure quorum correctly but never validate that onPartition: "reject" actually fires. Inject a controlled network partition in staging every quarter and verify that the circuit breaker halts writes and surfaces the correct error code to the application.

Setting AFFINITY_TTL to zero or infinity. Zero means every read re-evaluates node selection, burning Redis lookup latency on hot paths. Infinity means stale sticky tokens survive replica failures, routing reads to an unreachable node indefinitely. A 20–60 s TTL covers the replication lag of healthy replicas while remaining short enough to self-heal after a failover.

FAQ

When should I choose eventual over strong consistency?

Choose eventual consistency when write availability and low latency outweigh immediate read accuracy — caching layers, event logging, telemetry, and user-preference services where brief staleness is acceptable. Choose strong (linearizable) consistency for financial state, inventory counters, or any data where two concurrent readers must never observe conflicting values at the same instant.

How do I enforce read-your-writes consistency without full quorum reads?

Use session affinity: record the node that accepted each write and route the subsequent read to that same node using a short-lived sticky token (30 s TTL in Redis). The read hits a single replica rather than requiring a quorum acknowledgement from multiple nodes. Release the pin once measured replication lag on that node drops below your staleness threshold. This costs one Redis round-trip per read but avoids the multi-node coordination overhead of a full quorum read.

What metrics indicate consistency degradation before user-facing errors appear?

Watch four signals in order of increasing severity: (1) replication lag p95 rising above your staleness SLA, (2) lag variance between replicas widening (a sign of divergence rather than uniform slowness), (3) quorum timeout rate climbing above 5% of operations, and (4) Merkle mismatch count from your anti-entropy scanner increasing between runs. The first two appear minutes before the third and fourth — catching them early gives you time to isolate the slow replica before clients observe stale reads.

Articles in This Section