Skip to main content

How to Choose Between Sharding and Partitioning for High-Traffic Apps

This guide explains how to decide between single-node declarative partitioning and distributed sharding for high-traffic workloads, as part of Database Partitioning Fundamentals & Architecture. The decision directly impacts P95 latency, incident blast radius, and your ability to scale without downtime.

Prerequisites

Before working through the steps below, confirm the following:


Step 1: Analyze Query Patterns and Data Access Locality

Determine whether your workload can be satisfied by a single-node partition key or whether queries inherently cross node boundaries.

-- Extract the top 20 queries by total execution time (PostgreSQL)
SELECT
  query,
  calls,
  total_exec_time / 1000        AS total_sec,
  mean_exec_time                AS mean_ms,
  stddev_exec_time              AS stddev_ms,
  rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Operational note: Export these queries and parse their WHERE and JOIN clauses. Map every filter column to a candidate partition or shard key. If β‰₯80% of your top-20 queries filter on a single column (e.g. created_at, tenant_id, region_code), that column is your partition key candidate, and single-node range partitioning may be sufficient.

SRE tip: Run EXPLAIN (ANALYZE, BUFFERS) on your three highest-load queries using the candidate key. The plan must show Seq Scan on <child_table> for just one or two child tables β€” never a full scan of the parent. If the planner cannot prune >90% of disk I/O, the key is misaligned. Common culprits: implicit type casts, function wrappers around the partition column in WHERE, or composite conditions where only the secondary column is selective.


Step 2: Evaluate Consistency and Transactional Boundaries

Match your consistency model to the distribution strategy before committing. Distributed architectures introduce network latency and partial failure risks that a single-node setup avoids.

-- Identify tables involved in multi-table transactions (PostgreSQL)
-- Look for transactions that touch >1 table in the same BEGIN/COMMIT block
SELECT
  pid,
  query_start,
  state,
  query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND query_start < NOW() - INTERVAL '5 seconds'
ORDER BY query_start;

Operational note: If your business logic requires ACID compliance across multiple related entities (e.g. order creation + inventory decrement + payment record in one transaction), distributing those tables across independent nodes forces two-phase commit (2PC). Each 2PC round-trip adds 1–2 network RTTs per write. For sub-50 ms SLAs, that overhead is often unacceptable.

SRE tip: Audit the scope of your largest transactions. If every cross-entity write stays within a single tenant_id or region_code, you can shard on that key and keep each transaction entirely within one shard, preserving ACID without 2PC overhead. See consistency models in distributed databases for the full tradeoff matrix.

The diagram below maps your consistency requirement to the viable distribution strategy:

Consistency vs Distribution Decision Tree A decision tree showing that strong ACID requirements favour single-node partitioning, eventual consistency enables sharding with async replication, and cross-entity ACID forces 2PC or Saga patterns when sharding is unavoidable. Consistency requirement? scope of atomic writes Strong ACID Eventual / relaxed Fits single node? CPU/IO/storage ceiling Shard key isolates all writes per tenant/region? Yes Partition declarative, local ACID No Shard + 2PC/Saga unavoidable cross-node writes Yes Shard + async idempotent writes required No Redesign data model denormalize or co-locate Transitioning consistency models requires dual-write phases and reconciliation jobs. Never flip consistency models during peak traffic windows.

Step 3: Calculate Scaling Limits and Cost Tradeoffs

Quantify infrastructure ceilings and operational expenses before provisioning new nodes. Premature horizontal scaling inflates total cost of ownership without measurable throughput gains.

-- Measure current table size and index bloat (PostgreSQL)
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
  pg_size_pretty(pg_relation_size(schemaname || '.' || tablename))       AS table_size,
  pg_size_pretty(
    pg_total_relation_size(schemaname || '.' || tablename)
    - pg_relation_size(schemaname || '.' || tablename)
  ) AS index_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
LIMIT 10;

Operational note: Use the table below to map your current metrics to the right distribution strategy. Measure sustained CPU utilization, disk I/O wait, and connection saturation over a 7-day window β€” not just peak spikes.

Metric Single-node partitioning Distributed sharding
Max practical hot rows ~50 M–200 M (index bloat dependent) Billions+ (linear with shard count)
Write throughput ceiling Single-node WAL and I/O Parallelized across shards
Rebalancing cost Near-zero (ALTER TABLE ATTACH) High (data migration + router updates)
Cross-node join cost Zero (local planner) 3–10Γ— latency via scatter-gather
Backup and restore Single snapshot, fast RTO Coordinated multi-node, longer RTO
Operational overhead Standard DBA tooling Custom routing, monitoring, failover
ACID across entities Full, no coordination Requires 2PC or Saga patterns

SRE tip: If sustained CPU is below 75% and disk I/O wait is below 20%, optimize indexes and connection pool sizing before considering a topology change. Review scaling limits and cost tradeoffs for a detailed capacity planning methodology and cost model.


Step 4: Execute the Architecture Decision

Apply the quantified metrics from Steps 1–3 to select a concrete implementation path.

# Quick decision heuristic β€” run against your metrics
# Replace values with your measured figures
ROW_COUNT=75000000        # rows in the hot table
CPU_SUSTAINED_PCT=68      # 7-day sustained CPU %
IO_WAIT_PCT=12            # 7-day sustained I/O wait %
CROSS_TENANT_JOIN_PCT=8   # % of queries that join across tenant boundaries

echo "=== Architecture Decision ==="
if [ "$ROW_COUNT" -lt 10000000 ] && [ "$CPU_SUSTAINED_PCT" -lt 50 ]; then
  echo "RECOMMENDATION: Declarative partitioning only."
elif [ "$ROW_COUNT" -lt 100000000 ] && [ "$CROSS_TENANT_JOIN_PCT" -lt 20 ]; then
  echo "RECOMMENDATION: Partitioning + read replicas."
else
  echo "RECOMMENDATION: Evaluate sharding β€” verify 2PC tolerance first."
fi

Operational note: Apply these concrete thresholds to your measured values:

  • < 10 M rows, localized traffic, strict ACID: Use declarative range or list partitioning. No distributed complexity needed.
  • 10 M–100 M rows, moderate cross-tenant joins < 20%: Partitioning with read replicas. Add automated partition creation workflows for zero-touch housekeeping.
  • > 100 M rows, geo-distribution, or multi-tenant isolation mandated: Implement sharding, starting with consistent hash routing to prevent hot nodes.
  • Regulatory data residency requirements: Sharding by region or tenant is non-negotiable; partition pruning alone cannot enforce physical data placement.

SRE tip β€” rollback procedures for failed migrations:

  1. Maintain a shadow topology with identical schema during migration.
  2. Use feature flags in the application routing layer to toggle between old and new topologies without a deployment.
  3. If P95 latency degrades more than 30% post-cutover, revert routing flags immediately and drain traffic to the original topology.
  4. Preserve all migration WAL or binlog files for point-in-time recovery before decommissioning old nodes.

Verification

After selecting and implementing a strategy, confirm the architecture is behaving as expected.

-- Verify partition pruning is active (PostgreSQL)
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders
WHERE created_at >= '2025-01-01'
  AND created_at <  '2026-01-01'
  AND tenant_id = 42;
-- Expected: plan touches only orders_2025 child table.
-- Red flag: plan shows "Append" scanning ALL child tables.

-- Check partition row distribution for skew
SELECT
  child.relname                                  AS partition_name,
  pg_stat_get_live_tuples(child.oid)             AS live_rows,
  pg_size_pretty(pg_relation_size(child.oid))    AS size
FROM pg_inherits
JOIN pg_class parent ON pg_inherits.inhparent = parent.oid
JOIN pg_class child  ON pg_inherits.inhrelid  = child.oid
WHERE parent.relname = 'orders'
ORDER BY live_rows DESC;
-- Expected: roughly uniform row counts across partitions.
-- Red flag: one partition holds >3x the rows of peers (hot partition).

For a sharded setup, verify per-shard write distribution using your router’s metrics endpoint or by running the row-count query above against each shard node independently. Divergence greater than 30% between the most- and least-loaded shard indicates a shard key skew problem that will worsen under load.


Failure Mode Table

Failure mode Root cause SRE mitigation
Partition pruning silently disabled Implicit type cast on the partition key column (e.g. WHERE created_at::date = '2025-01-01') or a function wrapper prevents the planner from matching the partition constraint. Rewrite predicates to match the declared column type exactly. Run EXPLAIN on every new query template before deploying; include a CI gate that rejects full-parent-scan plans on partitioned tables.
Hot shard under consistent hashing Low-cardinality shard key (e.g. country_code with 30 distinct values) concentrates writes on a small number of virtual-node ranges, overwhelming one physical node. Switch to a composite shard key (region_code, user_id) or add a synthetic hash suffix to high-traffic keys. Monitor per-shard QPS and alert when any shard exceeds 2Γ— the fleet-wide median QPS.
Cross-shard write partially applied Network partition between the application and one shard node during a multi-shard write without 2PC or a compensating Saga leaves rows committed on one shard and absent on another. Implement idempotent writes with a correlation ID on every mutation. Use the Saga pattern with explicit compensating transactions, or adopt XA 2PC if your driver and database version support it reliably.

FAQ

When should I switch from partitioning to sharding?

Switch when the single node’s physical ceiling β€” CPU, memory, or storage IOPS β€” cannot be raised with vertical scaling and you have already optimized indexes, query plans, and connection pooling. Data residency regulations that mandate physical separation of data by region or tenant also force the sharding path, regardless of row count. Review the scaling limits and cost tradeoffs section for specific capacity thresholds and cost models.

Can partitioning and sharding be combined in one architecture?

Yes. Hybrid architectures are common at scale. Each shard holds a disjoint slice of the keyspace (e.g. tenant ID ranges), and within each shard the local tables are declaratively partitioned by time range for efficient pruning. This lets you apply time-range partition pruning to reduce I/O on each shard while still distributing write load horizontally. The operational cost increases β€” you now manage both partition housekeeping and a shard router β€” so the hybrid is only justified when a single shard would still hit storage or I/O ceilings on its own.

How do I prevent hot partitions in high-traffic apps?

Avoid shard or partition keys that are monotonically increasing (auto-increment IDs, timestamps alone) or that have very low cardinality (boolean flags, country codes with fewer than 100 distinct values). Use hash routing algorithms to distribute writes across virtual nodes, and add a salt prefix to sequential identifiers to break up write hotspots. After any key change, validate distribution by querying per-partition row counts and comparing them against the expected uniform distribution; more than a 3Γ— variance between any two partitions is a signal to revisit the key choice.