Can I convert a partitioned table to a sharded architecture without downtime?

Yes, but it requires a dual-write migration phase, data synchronization between old and new schemas, and traffic cutover via feature flags to prevent data loss. The new shard must reach checksum parity before reads are switched over.

When does single-node partitioning become insufficient?

When a single node's CPU, memory, or storage I/O ceiling is reached despite vertical scaling and query/index optimization. At that point, distributing data across independent nodes is the only remaining lever.

Sharding vs Partitioning: Core Concepts

Q: How do I handle transactions that span two shards?

Use two-phase commit (2PC) for strong consistency or the Saga pattern for eventual consistency. Both add latency, so design writes to be idempotent and retry-safe from the outset.

Partitioning and sharding both segment data to manage scale, but they operate at fundamentally different scopes — and conflating them leads to costly architectural mistakes. This page, part of Database Partitioning Fundamentals & Architecture, draws the exact boundary between the two approaches so you can reason clearly about routing complexity, transaction guarantees, and maintenance workflows. See also Consistency Models in Distributed Databases and Scaling Limits and Cost Tradeoffs for the broader architectural context.

Problem Framing

Consider a events table that has grown to 4 TB on a single PostgreSQL node. Queries filtered by created_at still hit 60-second timeouts despite a B-tree index on the column, because the planner cannot eliminate billions of rows with a single index scan. Two paths exist:

Declarative partitioning — split the table into child segments (one per month) inside the same database process. The planner enforces partition pruning and eliminates non-matching children at plan time. No application changes needed.
Horizontal sharding — distribute rows across independent database instances (e.g. two separate PostgreSQL servers). Each instance owns a disjoint slice of the keyspace. The application or a proxy must resolve which instance owns each row before executing.

Both paths solve query performance, but the operational profiles diverge sharply. Choosing the wrong one — typically reaching for sharding when partitioning is sufficient — multiplies infrastructure cost and breaks ACID guarantees without a commensurate scaling benefit.

Architecture Overview

The diagram below maps the two data paths side-by-side: a single-instance partitioned table on the left and a sharded cluster on the right.

The key architectural difference: in partitioning, the query planner is the only routing actor, and it operates entirely within one process. In sharding, a separate routing layer (application code or a dedicated proxy) must run before any database connection is acquired.

Implementation Walkthrough

Step 1 — Set up declarative range partitioning (single instance)

PostgreSQL’s declarative partitioning requires no changes to the application connection string. The engine handles all routing internally.

-- Create the parent (partitioned) table
CREATE TABLE events (
  id          BIGINT GENERATED ALWAYS AS IDENTITY,
  created_at  TIMESTAMPTZ NOT NULL,
  tenant_id   INT NOT NULL,
  payload     JSONB
) PARTITION BY RANGE (created_at);

-- Attach annual child partitions
CREATE TABLE events_2024 PARTITION OF events
  FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

CREATE TABLE events_2025 PARTITION OF events
  FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');

-- Local index per child (covers all queries in range)
CREATE INDEX ON events_2024 (tenant_id, created_at);
CREATE INDEX ON events_2025 (tenant_id, created_at);

Verify that the planner uses partition pruning — it should show Seq Scan on events_2025 with no mention of events_2024 when the filter falls in 2025:

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM events
WHERE created_at >= '2025-06-01'
  AND tenant_id = 42;

Step 2 — Configure a shard router (multi-instance)

Sharding requires a router configuration that maps shard keys to node endpoints. The example below uses a YAML configuration for a consistent-hash router:

# shard_router.yaml
router:
  algorithm: consistent_hash
  virtual_nodes: 150        # higher = more even distribution; 100-200 is typical
  replication_factor: 2     # synchronous replicas per shard
  shards:
    - id: shard-0
      primary: "pg-shard-0.internal:5432"
      replica: "pg-shard-0-replica.internal:5432"
    - id: shard-1
      primary: "pg-shard-1.internal:5432"
      replica: "pg-shard-1-replica.internal:5432"
    - id: shard-2
      primary: "pg-shard-2.internal:5432"
      replica: "pg-shard-2-replica.internal:5432"
  shard_key_column: "tenant_id"
  fallback: "local_cache"   # serve stale reads if shard is temporarily unreachable

Application code must resolve the shard before acquiring a connection:

import hashlib
import yaml

with open("shard_router.yaml") as f:
    config = yaml.safe_load(f)["router"]

def get_shard(tenant_id: int) -> dict:
    """Map a tenant_id to its shard using consistent hashing."""
    key = hashlib.md5(str(tenant_id).encode()).hexdigest()
    ring_position = int(key, 16) % (len(config["shards"]) * config["virtual_nodes"])
    shard_index = ring_position // config["virtual_nodes"]
    return config["shards"][shard_index % len(config["shards"])]

# Usage: resolve shard before opening a connection
shard = get_shard(tenant_id=42)
conn = psycopg2.connect(host=shard["primary"], dbname="events")

Consistent hash routing distributes keys more evenly than modulo arithmetic and moves fewer rows when nodes are added or removed.

Step 3 — Add a new partition (partitioning path)

Attach a pre-built child table with no downtime in PostgreSQL 12+:

-- Build the child table independently (no lock on parent)
CREATE TABLE events_2026 (LIKE events INCLUDING DEFAULTS INCLUDING CONSTRAINTS);
ALTER TABLE events_2026 ADD CONSTRAINT events_2026_check
  CHECK (created_at >= '2026-01-01' AND created_at < '2027-01-01');

CREATE INDEX ON events_2026 (tenant_id, created_at);

-- Attach — acquires only a brief ShareUpdateExclusiveLock
ALTER TABLE events ATTACH PARTITION events_2026
  FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');

The lock duration is proportional to the CHECK CONSTRAINT validation scan on the child, which is why building and indexing the child first is mandatory for zero-downtime operations.

Step 4 — Add a shard (sharding path)

Expanding a sharded cluster requires a dual-write migration. The approach below minimises risk by incrementally shifting traffic:

#!/usr/bin/env bash
# dual_write_migration.sh — add shard-3 to the cluster

# 1. Provision the new instance and replica
psql -h pg-shard-3.internal -c "CREATE DATABASE events;"

# 2. Start dual-write: the router sends writes to both old and new shards
#    for tenant IDs in the range being migrated
export DUAL_WRITE_RANGE_START=7500
export DUAL_WRITE_RANGE_END=9999
./router_control.sh enable_dual_write --range "$DUAL_WRITE_RANGE_START-$DUAL_WRITE_RANGE_END" \
    --target shard-3

# 3. Backfill historical rows for that tenant range
psql -h pg-shard-2.internal \
  -c "COPY (SELECT * FROM events WHERE tenant_id BETWEEN $DUAL_WRITE_RANGE_START AND $DUAL_WRITE_RANGE_END) TO STDOUT" \
  | psql -h pg-shard-3.internal -c "COPY events FROM STDIN"

# 4. Validate checksum parity before cutting over reads
./checksum_validator.sh --source shard-2 --target shard-3 \
    --tenant-range "$DUAL_WRITE_RANGE_START-$DUAL_WRITE_RANGE_END"

# 5. Shift read traffic; remove dual-write for migrated range
./router_control.sh cutover_reads --range "$DUAL_WRITE_RANGE_START-$DUAL_WRITE_RANGE_END" \
    --target shard-3
./router_control.sh disable_dual_write --range "$DUAL_WRITE_RANGE_START-$DUAL_WRITE_RANGE_END"

Configuration Reference

Parameter	Partitioning default	Sharding recommendation	Rationale
`virtual_nodes`	N/A	100–200	Higher values improve key distribution; diminishing returns above 300
`replication_factor`	PostgreSQL streaming replication	2 (sync) per shard	Synchronous replication prevents shard data loss on failover
Partition count	12–120 (monthly/weekly)	3–20 shards	Partitions are cheap; shards multiply operational surface area
Shard key cardinality	N/A	>10 000 unique values	Low cardinality (e.g. country code) causes hot shards
Connection pool size	`max_connections` per instance	Per-shard pool, 20–100	Each shard is an independent server; pool exhaustion on one does not block others
`lock_timeout`	5 s (partition attach)	30 s (migration operations)	Shard migrations hold longer locks; aggressive timeouts cause partial migrations

Operational Contrast

Partitioning fits within the same operational model as a regular table. pg_dump, pg_restore, point-in-time recovery, and EXPLAIN ANALYZE all work without modification. The automated partition creation workflows available at the implementation layer handle routine housekeeping with a single scheduled job.

Sharding changes every operational workflow. Backups must be coordinated across all nodes to capture a consistent global snapshot. Restore procedures require resurrecting multiple independent instances and re-validating the shard map. Cross-shard aggregation queries — analytics that span the whole keyspace — must scatter requests to every shard and merge results in the application or a proxy routing layer, adding latency proportional to the slowest shard.

The operationally safe default is: start with partitioning; migrate to sharding only when a single node’s hardware ceiling cannot be raised further.

Failure Modes

Failure	Root cause	Detection	Mitigation
Partition pruning disabled	Partition key cast mismatch (e.g. implicit `TEXT` vs `TIMESTAMPTZ`) or function wrapper around the key in `WHERE`	`EXPLAIN` shows sequential scan on parent with all children listed	Rewrite predicates to match the column type exactly; never wrap the partition key in a function call
Hot shard under consistent hashing	Low-cardinality shard key (e.g. `country_code`) concentrates traffic on one virtual-node range	Per-shard CPU / QPS diverges >3× from median in monitoring	Switch to a composite shard key or add a synthetic suffix (e.g. `tenant_id
Cross-shard write partially applied	Network partition between application and shard-N during a multi-shard write with no 2PC	Inconsistent row counts between shards; orphaned rows with no matching parent row	Implement idempotent writes with a correlation ID; use saga compensating transactions or XA 2PC
Partition attach deadlock	Concurrent `INSERT` stream holds a `RowExclusiveLock` when `ATTACH PARTITION` waits for a brief `ShareUpdateExclusiveLock`	`pg_stat_activity` shows `ATTACH PARTITION` waiting; `pg_locks` shows contention	Pre-validate the `CHECK CONSTRAINT` on the child before attaching; pause the write stream during the attach window if writes are heavy

Common Mistakes

Using sharding as a direct replacement for partitioning. Sharding requires application-level routing code and introduces distributed-transaction complexity. Native partitioning is transparent to the application layer and should be exhausted first. The use-case mapping guide makes this decision explicit.
Choosing a shard key with low or monotonically increasing values. Auto-increment IDs as shard keys route all new writes to the same shard; UUIDs distribute writes evenly but make range scans expensive. Select a key that balances write distribution against your read access pattern.
Ignoring cross-segment query latency budgets. Scatter-gather execution fans out to every shard and waits for the slowest response. Without explicit timeout budgets and circuit breakers, one degraded shard stalls every global aggregation.
Attaching new partitions without pre-validating the check constraint. Skipping the CHECK CONSTRAINT pre-build causes ATTACH PARTITION to scan the entire child table while holding a lock, blocking writes for the duration of the scan on large tables.

FAQ

Can I migrate from partitioning to sharding without downtime?

Yes, but it requires a multi-stage dual-write process: (1) provision the shard instances and write the routing layer, (2) enable dual-writes to both old and new stores, (3) backfill historical data into the shards, (4) validate checksum parity, and (5) cut reads over via feature flag. The migration window is proportional to data volume; at terabyte scale, plan for days of dual-write before the final cutover.

How do I handle transactions that span two shards?

Two options: two-phase commit (2PC, via PREPARE TRANSACTION / COMMIT PREPARED) for strong consistency, or the Saga pattern for eventual consistency. 2PC adds a round-trip per write and requires the coordinator to survive crashes; Sagas require compensating transactions for each step. Both options increase write latency, so the preferred engineering approach is to design the shard key so most writes touch only one shard.

When does single-node partitioning stop being sufficient?

When the node’s physical ceiling — CPU, memory, or storage IOPS — is reached and cannot be raised further with vertical scaling. In practice, a well-tuned PostgreSQL or MySQL instance with NVMe storage and 96+ cores can handle several terabytes of hot data and tens of thousands of QPS before hitting a genuine hardware limit. Evaluate scaling limits and cost tradeoffs carefully before adding the operational complexity of sharding.

Database Partitioning Fundamentals & Architecture — parent section covering the full decision landscape
Consistency Models in Distributed Databases — how ACID vs eventual consistency maps to shard topologies
Scaling Limits and Cost Tradeoffs — capacity planning before choosing between partitioning and sharding
How to Choose Between Sharding and Partitioning for High-Traffic Apps — decision guide for specific workload profiles
Hash Routing Algorithms — consistent hashing implementation detail for sharded architectures

Articles in This Section