Skip to main content

Sharding vs Partitioning: Core Concepts

Partitioning and sharding both segment data to manage scale, but they operate at fundamentally different scopes β€” and conflating them leads to costly architectural mistakes. This page, part of Database Partitioning Fundamentals & Architecture, draws the exact boundary between the two approaches so you can reason clearly about routing complexity, transaction guarantees, and maintenance workflows. See also Consistency Models in Distributed Databases and Scaling Limits and Cost Tradeoffs for the broader architectural context.

Problem Framing

Consider a events table that has grown to 4 TB on a single PostgreSQL node. Queries filtered by created_at still hit 60-second timeouts despite a B-tree index on the column, because the planner cannot eliminate billions of rows with a single index scan. Two paths exist:

  1. Declarative partitioning β€” split the table into child segments (one per month) inside the same database process. The planner enforces partition pruning and eliminates non-matching children at plan time. No application changes needed.
  2. Horizontal sharding β€” distribute rows across independent database instances (e.g. two separate PostgreSQL servers). Each instance owns a disjoint slice of the keyspace. The application or a proxy must resolve which instance owns each row before executing.

Both paths solve query performance, but the operational profiles diverge sharply. Choosing the wrong one β€” typically reaching for sharding when partitioning is sufficient β€” multiplies infrastructure cost and breaks ACID guarantees without a commensurate scaling benefit.

Architecture Overview

The diagram below maps the two data paths side-by-side: a single-instance partitioned table on the left and a sharded cluster on the right.

Partitioning vs Sharding Architecture Left side shows a single application connecting to one database, whose query planner routes to three partition child tables. Right side shows the same application connecting to a shard router, which distributes to three independent database instances. SINGLE-INSTANCE PARTITIONING MULTI-INSTANCE SHARDING Application no routing logic DB Engine query planner + pruning events_2023 child table events_2024 child table events_2025 child table shared WAL Β· shared connection pool Β· local ACID Application resolves shard key first Shard Router consistent-hash map Shard 0 own WAL + pool Shard 1 own WAL + pool Shard 2 own WAL + pool independent WALs Β· cross-shard 2PC needed Β· network latency

The key architectural difference: in partitioning, the query planner is the only routing actor, and it operates entirely within one process. In sharding, a separate routing layer (application code or a dedicated proxy) must run before any database connection is acquired.

Implementation Walkthrough

Step 1 β€” Set up declarative range partitioning (single instance)

PostgreSQL’s declarative partitioning requires no changes to the application connection string. The engine handles all routing internally.

-- Create the parent (partitioned) table
CREATE TABLE events (
  id          BIGINT GENERATED ALWAYS AS IDENTITY,
  created_at  TIMESTAMPTZ NOT NULL,
  tenant_id   INT NOT NULL,
  payload     JSONB
) PARTITION BY RANGE (created_at);

-- Attach annual child partitions
CREATE TABLE events_2024 PARTITION OF events
  FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

CREATE TABLE events_2025 PARTITION OF events
  FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');

-- Local index per child (covers all queries in range)
CREATE INDEX ON events_2024 (tenant_id, created_at);
CREATE INDEX ON events_2025 (tenant_id, created_at);

Verify that the planner uses partition pruning β€” it should show Seq Scan on events_2025 with no mention of events_2024 when the filter falls in 2025:

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM events
WHERE created_at >= '2025-06-01'
  AND tenant_id = 42;

Step 2 β€” Configure a shard router (multi-instance)

Sharding requires a router configuration that maps shard keys to node endpoints. The example below uses a YAML configuration for a consistent-hash router:

# shard_router.yaml
router:
  algorithm: consistent_hash
  virtual_nodes: 150        # higher = more even distribution; 100-200 is typical
  replication_factor: 2     # synchronous replicas per shard
  shards:
    - id: shard-0
      primary: "pg-shard-0.internal:5432"
      replica: "pg-shard-0-replica.internal:5432"
    - id: shard-1
      primary: "pg-shard-1.internal:5432"
      replica: "pg-shard-1-replica.internal:5432"
    - id: shard-2
      primary: "pg-shard-2.internal:5432"
      replica: "pg-shard-2-replica.internal:5432"
  shard_key_column: "tenant_id"
  fallback: "local_cache"   # serve stale reads if shard is temporarily unreachable

Application code must resolve the shard before acquiring a connection:

import hashlib
import yaml

with open("shard_router.yaml") as f:
    config = yaml.safe_load(f)["router"]

def get_shard(tenant_id: int) -> dict:
    """Map a tenant_id to its shard using consistent hashing."""
    key = hashlib.md5(str(tenant_id).encode()).hexdigest()
    ring_position = int(key, 16) % (len(config["shards"]) * config["virtual_nodes"])
    shard_index = ring_position // config["virtual_nodes"]
    return config["shards"][shard_index % len(config["shards"])]

# Usage: resolve shard before opening a connection
shard = get_shard(tenant_id=42)
conn = psycopg2.connect(host=shard["primary"], dbname="events")

Consistent hash routing distributes keys more evenly than modulo arithmetic and moves fewer rows when nodes are added or removed.

Step 3 β€” Add a new partition (partitioning path)

Attach a pre-built child table with no downtime in PostgreSQL 12+:

-- Build the child table independently (no lock on parent)
CREATE TABLE events_2026 (LIKE events INCLUDING DEFAULTS INCLUDING CONSTRAINTS);
ALTER TABLE events_2026 ADD CONSTRAINT events_2026_check
  CHECK (created_at >= '2026-01-01' AND created_at < '2027-01-01');

CREATE INDEX ON events_2026 (tenant_id, created_at);

-- Attach β€” acquires only a brief ShareUpdateExclusiveLock
ALTER TABLE events ATTACH PARTITION events_2026
  FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');

The lock duration is proportional to the CHECK CONSTRAINT validation scan on the child, which is why building and indexing the child first is mandatory for zero-downtime operations.

Step 4 β€” Add a shard (sharding path)

Expanding a sharded cluster requires a dual-write migration. The approach below minimises risk by incrementally shifting traffic:

#!/usr/bin/env bash
# dual_write_migration.sh β€” add shard-3 to the cluster

# 1. Provision the new instance and replica
psql -h pg-shard-3.internal -c "CREATE DATABASE events;"

# 2. Start dual-write: the router sends writes to both old and new shards
#    for tenant IDs in the range being migrated
export DUAL_WRITE_RANGE_START=7500
export DUAL_WRITE_RANGE_END=9999
./router_control.sh enable_dual_write --range "$DUAL_WRITE_RANGE_START-$DUAL_WRITE_RANGE_END" \
    --target shard-3

# 3. Backfill historical rows for that tenant range
psql -h pg-shard-2.internal \
  -c "COPY (SELECT * FROM events WHERE tenant_id BETWEEN $DUAL_WRITE_RANGE_START AND $DUAL_WRITE_RANGE_END) TO STDOUT" \
  | psql -h pg-shard-3.internal -c "COPY events FROM STDIN"

# 4. Validate checksum parity before cutting over reads
./checksum_validator.sh --source shard-2 --target shard-3 \
    --tenant-range "$DUAL_WRITE_RANGE_START-$DUAL_WRITE_RANGE_END"

# 5. Shift read traffic; remove dual-write for migrated range
./router_control.sh cutover_reads --range "$DUAL_WRITE_RANGE_START-$DUAL_WRITE_RANGE_END" \
    --target shard-3
./router_control.sh disable_dual_write --range "$DUAL_WRITE_RANGE_START-$DUAL_WRITE_RANGE_END"

Configuration Reference

Parameter Partitioning default Sharding recommendation Rationale
virtual_nodes N/A 100–200 Higher values improve key distribution; diminishing returns above 300
replication_factor PostgreSQL streaming replication 2 (sync) per shard Synchronous replication prevents shard data loss on failover
Partition count 12–120 (monthly/weekly) 3–20 shards Partitions are cheap; shards multiply operational surface area
Shard key cardinality N/A >10 000 unique values Low cardinality (e.g. country code) causes hot shards
Connection pool size max_connections per instance Per-shard pool, 20–100 Each shard is an independent server; pool exhaustion on one does not block others
lock_timeout 5 s (partition attach) 30 s (migration operations) Shard migrations hold longer locks; aggressive timeouts cause partial migrations

Operational Contrast

Partitioning fits within the same operational model as a regular table. pg_dump, pg_restore, point-in-time recovery, and EXPLAIN ANALYZE all work without modification. The automated partition creation workflows available at the implementation layer handle routine housekeeping with a single scheduled job.

Sharding changes every operational workflow. Backups must be coordinated across all nodes to capture a consistent global snapshot. Restore procedures require resurrecting multiple independent instances and re-validating the shard map. Cross-shard aggregation queries β€” analytics that span the whole keyspace β€” must scatter requests to every shard and merge results in the application or a proxy routing layer, adding latency proportional to the slowest shard.

The operationally safe default is: start with partitioning; migrate to sharding only when a single node’s hardware ceiling cannot be raised further.

Failure Modes

Failure Root cause Detection Mitigation
Partition pruning disabled Partition key cast mismatch (e.g. implicit TEXT vs TIMESTAMPTZ) or function wrapper around the key in WHERE EXPLAIN shows sequential scan on parent with all children listed Rewrite predicates to match the column type exactly; never wrap the partition key in a function call
Hot shard under consistent hashing Low-cardinality shard key (e.g. country_code) concentrates traffic on one virtual-node range Per-shard CPU / QPS diverges >3Γ— from median in monitoring Switch to a composite shard key or add a synthetic suffix (e.g. `tenant_id
Cross-shard write partially applied Network partition between application and shard-N during a multi-shard write with no 2PC Inconsistent row counts between shards; orphaned rows with no matching parent row Implement idempotent writes with a correlation ID; use saga compensating transactions or XA 2PC
Partition attach deadlock Concurrent INSERT stream holds a RowExclusiveLock when ATTACH PARTITION waits for a brief ShareUpdateExclusiveLock pg_stat_activity shows ATTACH PARTITION waiting; pg_locks shows contention Pre-validate the CHECK CONSTRAINT on the child before attaching; pause the write stream during the attach window if writes are heavy

Common Mistakes

  • Using sharding as a direct replacement for partitioning. Sharding requires application-level routing code and introduces distributed-transaction complexity. Native partitioning is transparent to the application layer and should be exhausted first. The use-case mapping guide makes this decision explicit.
  • Choosing a shard key with low or monotonically increasing values. Auto-increment IDs as shard keys route all new writes to the same shard; UUIDs distribute writes evenly but make range scans expensive. Select a key that balances write distribution against your read access pattern.
  • Ignoring cross-segment query latency budgets. Scatter-gather execution fans out to every shard and waits for the slowest response. Without explicit timeout budgets and circuit breakers, one degraded shard stalls every global aggregation.
  • Attaching new partitions without pre-validating the check constraint. Skipping the CHECK CONSTRAINT pre-build causes ATTACH PARTITION to scan the entire child table while holding a lock, blocking writes for the duration of the scan on large tables.

FAQ

Can I migrate from partitioning to sharding without downtime?

Yes, but it requires a multi-stage dual-write process: (1) provision the shard instances and write the routing layer, (2) enable dual-writes to both old and new stores, (3) backfill historical data into the shards, (4) validate checksum parity, and (5) cut reads over via feature flag. The migration window is proportional to data volume; at terabyte scale, plan for days of dual-write before the final cutover.

How do I handle transactions that span two shards?

Two options: two-phase commit (2PC, via PREPARE TRANSACTION / COMMIT PREPARED) for strong consistency, or the Saga pattern for eventual consistency. 2PC adds a round-trip per write and requires the coordinator to survive crashes; Sagas require compensating transactions for each step. Both options increase write latency, so the preferred engineering approach is to design the shard key so most writes touch only one shard.

When does single-node partitioning stop being sufficient?

When the node’s physical ceiling β€” CPU, memory, or storage IOPS β€” is reached and cannot be raised further with vertical scaling. In practice, a well-tuned PostgreSQL or MySQL instance with NVMe storage and 96+ cores can handle several terabytes of hot data and tens of thousands of QPS before hitting a genuine hardware limit. Evaluate scaling limits and cost tradeoffs carefully before adding the operational complexity of sharding.

Articles in This Section