Sharding vs Partitioning: Core Concepts
Partitioning and sharding both segment data to manage scale, but they operate at fundamentally different scopes β and conflating them leads to costly architectural mistakes. This page, part of Database Partitioning Fundamentals & Architecture, draws the exact boundary between the two approaches so you can reason clearly about routing complexity, transaction guarantees, and maintenance workflows. See also Consistency Models in Distributed Databases and Scaling Limits and Cost Tradeoffs for the broader architectural context.
Problem Framing
Consider a events table that has grown to 4 TB on a single PostgreSQL node. Queries filtered by created_at still hit 60-second timeouts despite a B-tree index on the column, because the planner cannot eliminate billions of rows with a single index scan. Two paths exist:
- Declarative partitioning β split the table into child segments (one per month) inside the same database process. The planner enforces partition pruning and eliminates non-matching children at plan time. No application changes needed.
- Horizontal sharding β distribute rows across independent database instances (e.g. two separate PostgreSQL servers). Each instance owns a disjoint slice of the keyspace. The application or a proxy must resolve which instance owns each row before executing.
Both paths solve query performance, but the operational profiles diverge sharply. Choosing the wrong one β typically reaching for sharding when partitioning is sufficient β multiplies infrastructure cost and breaks ACID guarantees without a commensurate scaling benefit.
Architecture Overview
The diagram below maps the two data paths side-by-side: a single-instance partitioned table on the left and a sharded cluster on the right.
The key architectural difference: in partitioning, the query planner is the only routing actor, and it operates entirely within one process. In sharding, a separate routing layer (application code or a dedicated proxy) must run before any database connection is acquired.
Implementation Walkthrough
Step 1 β Set up declarative range partitioning (single instance)
PostgreSQLβs declarative partitioning requires no changes to the application connection string. The engine handles all routing internally.
-- Create the parent (partitioned) table
CREATE TABLE events (
id BIGINT GENERATED ALWAYS AS IDENTITY,
created_at TIMESTAMPTZ NOT NULL,
tenant_id INT NOT NULL,
payload JSONB
) PARTITION BY RANGE (created_at);
-- Attach annual child partitions
CREATE TABLE events_2024 PARTITION OF events
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
CREATE TABLE events_2025 PARTITION OF events
FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
-- Local index per child (covers all queries in range)
CREATE INDEX ON events_2024 (tenant_id, created_at);
CREATE INDEX ON events_2025 (tenant_id, created_at);
Verify that the planner uses partition pruning β it should show Seq Scan on events_2025 with no mention of events_2024 when the filter falls in 2025:
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM events
WHERE created_at >= '2025-06-01'
AND tenant_id = 42;
Step 2 β Configure a shard router (multi-instance)
Sharding requires a router configuration that maps shard keys to node endpoints. The example below uses a YAML configuration for a consistent-hash router:
# shard_router.yaml
router:
algorithm: consistent_hash
virtual_nodes: 150 # higher = more even distribution; 100-200 is typical
replication_factor: 2 # synchronous replicas per shard
shards:
- id: shard-0
primary: "pg-shard-0.internal:5432"
replica: "pg-shard-0-replica.internal:5432"
- id: shard-1
primary: "pg-shard-1.internal:5432"
replica: "pg-shard-1-replica.internal:5432"
- id: shard-2
primary: "pg-shard-2.internal:5432"
replica: "pg-shard-2-replica.internal:5432"
shard_key_column: "tenant_id"
fallback: "local_cache" # serve stale reads if shard is temporarily unreachable
Application code must resolve the shard before acquiring a connection:
import hashlib
import yaml
with open("shard_router.yaml") as f:
config = yaml.safe_load(f)["router"]
def get_shard(tenant_id: int) -> dict:
"""Map a tenant_id to its shard using consistent hashing."""
key = hashlib.md5(str(tenant_id).encode()).hexdigest()
ring_position = int(key, 16) % (len(config["shards"]) * config["virtual_nodes"])
shard_index = ring_position // config["virtual_nodes"]
return config["shards"][shard_index % len(config["shards"])]
# Usage: resolve shard before opening a connection
shard = get_shard(tenant_id=42)
conn = psycopg2.connect(host=shard["primary"], dbname="events")
Consistent hash routing distributes keys more evenly than modulo arithmetic and moves fewer rows when nodes are added or removed.
Step 3 β Add a new partition (partitioning path)
Attach a pre-built child table with no downtime in PostgreSQL 12+:
-- Build the child table independently (no lock on parent)
CREATE TABLE events_2026 (LIKE events INCLUDING DEFAULTS INCLUDING CONSTRAINTS);
ALTER TABLE events_2026 ADD CONSTRAINT events_2026_check
CHECK (created_at >= '2026-01-01' AND created_at < '2027-01-01');
CREATE INDEX ON events_2026 (tenant_id, created_at);
-- Attach β acquires only a brief ShareUpdateExclusiveLock
ALTER TABLE events ATTACH PARTITION events_2026
FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');
The lock duration is proportional to the CHECK CONSTRAINT validation scan on the child, which is why building and indexing the child first is mandatory for zero-downtime operations.
Step 4 β Add a shard (sharding path)
Expanding a sharded cluster requires a dual-write migration. The approach below minimises risk by incrementally shifting traffic:
#!/usr/bin/env bash
# dual_write_migration.sh β add shard-3 to the cluster
# 1. Provision the new instance and replica
psql -h pg-shard-3.internal -c "CREATE DATABASE events;"
# 2. Start dual-write: the router sends writes to both old and new shards
# for tenant IDs in the range being migrated
export DUAL_WRITE_RANGE_START=7500
export DUAL_WRITE_RANGE_END=9999
./router_control.sh enable_dual_write --range "$DUAL_WRITE_RANGE_START-$DUAL_WRITE_RANGE_END" \
--target shard-3
# 3. Backfill historical rows for that tenant range
psql -h pg-shard-2.internal \
-c "COPY (SELECT * FROM events WHERE tenant_id BETWEEN $DUAL_WRITE_RANGE_START AND $DUAL_WRITE_RANGE_END) TO STDOUT" \
| psql -h pg-shard-3.internal -c "COPY events FROM STDIN"
# 4. Validate checksum parity before cutting over reads
./checksum_validator.sh --source shard-2 --target shard-3 \
--tenant-range "$DUAL_WRITE_RANGE_START-$DUAL_WRITE_RANGE_END"
# 5. Shift read traffic; remove dual-write for migrated range
./router_control.sh cutover_reads --range "$DUAL_WRITE_RANGE_START-$DUAL_WRITE_RANGE_END" \
--target shard-3
./router_control.sh disable_dual_write --range "$DUAL_WRITE_RANGE_START-$DUAL_WRITE_RANGE_END"
Configuration Reference
| Parameter | Partitioning default | Sharding recommendation | Rationale |
|---|---|---|---|
virtual_nodes |
N/A | 100β200 | Higher values improve key distribution; diminishing returns above 300 |
replication_factor |
PostgreSQL streaming replication | 2 (sync) per shard | Synchronous replication prevents shard data loss on failover |
| Partition count | 12β120 (monthly/weekly) | 3β20 shards | Partitions are cheap; shards multiply operational surface area |
| Shard key cardinality | N/A | >10 000 unique values | Low cardinality (e.g. country code) causes hot shards |
| Connection pool size | max_connections per instance |
Per-shard pool, 20β100 | Each shard is an independent server; pool exhaustion on one does not block others |
lock_timeout |
5 s (partition attach) | 30 s (migration operations) | Shard migrations hold longer locks; aggressive timeouts cause partial migrations |
Operational Contrast
Partitioning fits within the same operational model as a regular table. pg_dump, pg_restore, point-in-time recovery, and EXPLAIN ANALYZE all work without modification. The automated partition creation workflows available at the implementation layer handle routine housekeeping with a single scheduled job.
Sharding changes every operational workflow. Backups must be coordinated across all nodes to capture a consistent global snapshot. Restore procedures require resurrecting multiple independent instances and re-validating the shard map. Cross-shard aggregation queries β analytics that span the whole keyspace β must scatter requests to every shard and merge results in the application or a proxy routing layer, adding latency proportional to the slowest shard.
The operationally safe default is: start with partitioning; migrate to sharding only when a single nodeβs hardware ceiling cannot be raised further.
Failure Modes
| Failure | Root cause | Detection | Mitigation |
|---|---|---|---|
| Partition pruning disabled | Partition key cast mismatch (e.g. implicit TEXT vs TIMESTAMPTZ) or function wrapper around the key in WHERE |
EXPLAIN shows sequential scan on parent with all children listed |
Rewrite predicates to match the column type exactly; never wrap the partition key in a function call |
| Hot shard under consistent hashing | Low-cardinality shard key (e.g. country_code) concentrates traffic on one virtual-node range |
Per-shard CPU / QPS diverges >3Γ from median in monitoring | Switch to a composite shard key or add a synthetic suffix (e.g. `tenant_id |
| Cross-shard write partially applied | Network partition between application and shard-N during a multi-shard write with no 2PC | Inconsistent row counts between shards; orphaned rows with no matching parent row | Implement idempotent writes with a correlation ID; use saga compensating transactions or XA 2PC |
| Partition attach deadlock | Concurrent INSERT stream holds a RowExclusiveLock when ATTACH PARTITION waits for a brief ShareUpdateExclusiveLock |
pg_stat_activity shows ATTACH PARTITION waiting; pg_locks shows contention |
Pre-validate the CHECK CONSTRAINT on the child before attaching; pause the write stream during the attach window if writes are heavy |
Common Mistakes
- Using sharding as a direct replacement for partitioning. Sharding requires application-level routing code and introduces distributed-transaction complexity. Native partitioning is transparent to the application layer and should be exhausted first. The use-case mapping guide makes this decision explicit.
- Choosing a shard key with low or monotonically increasing values. Auto-increment IDs as shard keys route all new writes to the same shard; UUIDs distribute writes evenly but make range scans expensive. Select a key that balances write distribution against your read access pattern.
- Ignoring cross-segment query latency budgets. Scatter-gather execution fans out to every shard and waits for the slowest response. Without explicit timeout budgets and circuit breakers, one degraded shard stalls every global aggregation.
- Attaching new partitions without pre-validating the check constraint. Skipping the
CHECK CONSTRAINTpre-build causesATTACH PARTITIONto scan the entire child table while holding a lock, blocking writes for the duration of the scan on large tables.
FAQ
Can I migrate from partitioning to sharding without downtime?
Yes, but it requires a multi-stage dual-write process: (1) provision the shard instances and write the routing layer, (2) enable dual-writes to both old and new stores, (3) backfill historical data into the shards, (4) validate checksum parity, and (5) cut reads over via feature flag. The migration window is proportional to data volume; at terabyte scale, plan for days of dual-write before the final cutover.
How do I handle transactions that span two shards?
Two options: two-phase commit (2PC, via PREPARE TRANSACTION / COMMIT PREPARED) for strong consistency, or the Saga pattern for eventual consistency. 2PC adds a round-trip per write and requires the coordinator to survive crashes; Sagas require compensating transactions for each step. Both options increase write latency, so the preferred engineering approach is to design the shard key so most writes touch only one shard.
When does single-node partitioning stop being sufficient?
When the nodeβs physical ceiling β CPU, memory, or storage IOPS β is reached and cannot be raised further with vertical scaling. In practice, a well-tuned PostgreSQL or MySQL instance with NVMe storage and 96+ cores can handle several terabytes of hot data and tens of thousands of QPS before hitting a genuine hardware limit. Evaluate scaling limits and cost tradeoffs carefully before adding the operational complexity of sharding.
Related
- Database Partitioning Fundamentals & Architecture β parent section covering the full decision landscape
- Consistency Models in Distributed Databases β how ACID vs eventual consistency maps to shard topologies
- Scaling Limits and Cost Tradeoffs β capacity planning before choosing between partitioning and sharding
- How to Choose Between Sharding and Partitioning for High-Traffic Apps β decision guide for specific workload profiles
- Hash Routing Algorithms β consistent hashing implementation detail for sharded architectures