How to Choose Between Sharding and Partitioning for High-Traffic Apps
This guide explains how to decide between single-node declarative partitioning and distributed sharding for high-traffic workloads, as part of Database Partitioning Fundamentals & Architecture. The decision directly impacts P95 latency, incident blast radius, and your ability to scale without downtime.
Prerequisites
Before working through the steps below, confirm the following:
Step 1: Analyze Query Patterns and Data Access Locality
Determine whether your workload can be satisfied by a single-node partition key or whether queries inherently cross node boundaries.
-- Extract the top 20 queries by total execution time (PostgreSQL)
SELECT
query,
calls,
total_exec_time / 1000 AS total_sec,
mean_exec_time AS mean_ms,
stddev_exec_time AS stddev_ms,
rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
Operational note: Export these queries and parse their WHERE and JOIN clauses. Map every filter column to a candidate partition or shard key. If β₯80% of your top-20 queries filter on a single column (e.g. created_at, tenant_id, region_code), that column is your partition key candidate, and single-node range partitioning may be sufficient.
SRE tip: Run EXPLAIN (ANALYZE, BUFFERS) on your three highest-load queries using the candidate key. The plan must show Seq Scan on <child_table> for just one or two child tables β never a full scan of the parent. If the planner cannot prune >90% of disk I/O, the key is misaligned. Common culprits: implicit type casts, function wrappers around the partition column in WHERE, or composite conditions where only the secondary column is selective.
Step 2: Evaluate Consistency and Transactional Boundaries
Match your consistency model to the distribution strategy before committing. Distributed architectures introduce network latency and partial failure risks that a single-node setup avoids.
-- Identify tables involved in multi-table transactions (PostgreSQL)
-- Look for transactions that touch >1 table in the same BEGIN/COMMIT block
SELECT
pid,
query_start,
state,
query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND query_start < NOW() - INTERVAL '5 seconds'
ORDER BY query_start;
Operational note: If your business logic requires ACID compliance across multiple related entities (e.g. order creation + inventory decrement + payment record in one transaction), distributing those tables across independent nodes forces two-phase commit (2PC). Each 2PC round-trip adds 1β2 network RTTs per write. For sub-50 ms SLAs, that overhead is often unacceptable.
SRE tip: Audit the scope of your largest transactions. If every cross-entity write stays within a single tenant_id or region_code, you can shard on that key and keep each transaction entirely within one shard, preserving ACID without 2PC overhead. See consistency models in distributed databases for the full tradeoff matrix.
The diagram below maps your consistency requirement to the viable distribution strategy:
Step 3: Calculate Scaling Limits and Cost Tradeoffs
Quantify infrastructure ceilings and operational expenses before provisioning new nodes. Premature horizontal scaling inflates total cost of ownership without measurable throughput gains.
-- Measure current table size and index bloat (PostgreSQL)
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
pg_size_pretty(
pg_total_relation_size(schemaname || '.' || tablename)
- pg_relation_size(schemaname || '.' || tablename)
) AS index_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
LIMIT 10;
Operational note: Use the table below to map your current metrics to the right distribution strategy. Measure sustained CPU utilization, disk I/O wait, and connection saturation over a 7-day window β not just peak spikes.
| Metric | Single-node partitioning | Distributed sharding |
|---|---|---|
| Max practical hot rows | ~50 Mβ200 M (index bloat dependent) | Billions+ (linear with shard count) |
| Write throughput ceiling | Single-node WAL and I/O | Parallelized across shards |
| Rebalancing cost | Near-zero (ALTER TABLE ATTACH) |
High (data migration + router updates) |
| Cross-node join cost | Zero (local planner) | 3β10Γ latency via scatter-gather |
| Backup and restore | Single snapshot, fast RTO | Coordinated multi-node, longer RTO |
| Operational overhead | Standard DBA tooling | Custom routing, monitoring, failover |
| ACID across entities | Full, no coordination | Requires 2PC or Saga patterns |
SRE tip: If sustained CPU is below 75% and disk I/O wait is below 20%, optimize indexes and connection pool sizing before considering a topology change. Review scaling limits and cost tradeoffs for a detailed capacity planning methodology and cost model.
Step 4: Execute the Architecture Decision
Apply the quantified metrics from Steps 1β3 to select a concrete implementation path.
# Quick decision heuristic β run against your metrics
# Replace values with your measured figures
ROW_COUNT=75000000 # rows in the hot table
CPU_SUSTAINED_PCT=68 # 7-day sustained CPU %
IO_WAIT_PCT=12 # 7-day sustained I/O wait %
CROSS_TENANT_JOIN_PCT=8 # % of queries that join across tenant boundaries
echo "=== Architecture Decision ==="
if [ "$ROW_COUNT" -lt 10000000 ] && [ "$CPU_SUSTAINED_PCT" -lt 50 ]; then
echo "RECOMMENDATION: Declarative partitioning only."
elif [ "$ROW_COUNT" -lt 100000000 ] && [ "$CROSS_TENANT_JOIN_PCT" -lt 20 ]; then
echo "RECOMMENDATION: Partitioning + read replicas."
else
echo "RECOMMENDATION: Evaluate sharding β verify 2PC tolerance first."
fi
Operational note: Apply these concrete thresholds to your measured values:
- < 10 M rows, localized traffic, strict ACID: Use declarative range or list partitioning. No distributed complexity needed.
- 10 Mβ100 M rows, moderate cross-tenant joins < 20%: Partitioning with read replicas. Add automated partition creation workflows for zero-touch housekeeping.
- > 100 M rows, geo-distribution, or multi-tenant isolation mandated: Implement sharding, starting with consistent hash routing to prevent hot nodes.
- Regulatory data residency requirements: Sharding by region or tenant is non-negotiable; partition pruning alone cannot enforce physical data placement.
SRE tip β rollback procedures for failed migrations:
- Maintain a shadow topology with identical schema during migration.
- Use feature flags in the application routing layer to toggle between old and new topologies without a deployment.
- If P95 latency degrades more than 30% post-cutover, revert routing flags immediately and drain traffic to the original topology.
- Preserve all migration WAL or binlog files for point-in-time recovery before decommissioning old nodes.
Verification
After selecting and implementing a strategy, confirm the architecture is behaving as expected.
-- Verify partition pruning is active (PostgreSQL)
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders
WHERE created_at >= '2025-01-01'
AND created_at < '2026-01-01'
AND tenant_id = 42;
-- Expected: plan touches only orders_2025 child table.
-- Red flag: plan shows "Append" scanning ALL child tables.
-- Check partition row distribution for skew
SELECT
child.relname AS partition_name,
pg_stat_get_live_tuples(child.oid) AS live_rows,
pg_size_pretty(pg_relation_size(child.oid)) AS size
FROM pg_inherits
JOIN pg_class parent ON pg_inherits.inhparent = parent.oid
JOIN pg_class child ON pg_inherits.inhrelid = child.oid
WHERE parent.relname = 'orders'
ORDER BY live_rows DESC;
-- Expected: roughly uniform row counts across partitions.
-- Red flag: one partition holds >3x the rows of peers (hot partition).
For a sharded setup, verify per-shard write distribution using your routerβs metrics endpoint or by running the row-count query above against each shard node independently. Divergence greater than 30% between the most- and least-loaded shard indicates a shard key skew problem that will worsen under load.
Failure Mode Table
| Failure mode | Root cause | SRE mitigation |
|---|---|---|
| Partition pruning silently disabled | Implicit type cast on the partition key column (e.g. WHERE created_at::date = '2025-01-01') or a function wrapper prevents the planner from matching the partition constraint. |
Rewrite predicates to match the declared column type exactly. Run EXPLAIN on every new query template before deploying; include a CI gate that rejects full-parent-scan plans on partitioned tables. |
| Hot shard under consistent hashing | Low-cardinality shard key (e.g. country_code with 30 distinct values) concentrates writes on a small number of virtual-node ranges, overwhelming one physical node. |
Switch to a composite shard key (region_code, user_id) or add a synthetic hash suffix to high-traffic keys. Monitor per-shard QPS and alert when any shard exceeds 2Γ the fleet-wide median QPS. |
| Cross-shard write partially applied | Network partition between the application and one shard node during a multi-shard write without 2PC or a compensating Saga leaves rows committed on one shard and absent on another. | Implement idempotent writes with a correlation ID on every mutation. Use the Saga pattern with explicit compensating transactions, or adopt XA 2PC if your driver and database version support it reliably. |
FAQ
When should I switch from partitioning to sharding?
Switch when the single nodeβs physical ceiling β CPU, memory, or storage IOPS β cannot be raised with vertical scaling and you have already optimized indexes, query plans, and connection pooling. Data residency regulations that mandate physical separation of data by region or tenant also force the sharding path, regardless of row count. Review the scaling limits and cost tradeoffs section for specific capacity thresholds and cost models.
Can partitioning and sharding be combined in one architecture?
Yes. Hybrid architectures are common at scale. Each shard holds a disjoint slice of the keyspace (e.g. tenant ID ranges), and within each shard the local tables are declaratively partitioned by time range for efficient pruning. This lets you apply time-range partition pruning to reduce I/O on each shard while still distributing write load horizontally. The operational cost increases β you now manage both partition housekeeping and a shard router β so the hybrid is only justified when a single shard would still hit storage or I/O ceilings on its own.
How do I prevent hot partitions in high-traffic apps?
Avoid shard or partition keys that are monotonically increasing (auto-increment IDs, timestamps alone) or that have very low cardinality (boolean flags, country codes with fewer than 100 distinct values). Use hash routing algorithms to distribute writes across virtual nodes, and add a salt prefix to sequential identifiers to break up write hotspots. After any key change, validate distribution by querying per-partition row counts and comparing them against the expected uniform distribution; more than a 3Γ variance between any two partitions is a signal to revisit the key choice.
Related
- Sharding vs Partitioning: Core Concepts β parent page covering architectural boundaries, routing models, and transaction guarantees
- Scaling Limits and Cost Tradeoffs β capacity planning methodology and cost models before committing to a topology
- Hash Routing Algorithms β consistent hashing implementation for sharded architectures, including virtual-node allocation and ring topology setup