Skip to main content

Handling Hot Keys in List Partitioned Tables

This guide covers eliminating write hotspots caused by uneven value distribution in list-partitioned tables, which is a common operational problem in the broader partitioning implementation and routing discipline.

Prerequisites


Hot key sub-partitioning: before and after Left side shows a list-partitioned events table where the CLICK partition is overloaded. Right side shows the same table after the CLICK partition is sub-partitioned into four hash buckets, distributing writes evenly. Before: skewed After: sub-partitioned events (LIST) events (LIST) purchase CLICK πŸ”₯ ~78% of writes lock contention refund purchase refund CLICK (HASH) sub-partitioned rem 0 rem 1 rem 2 rem 3 ~25% writes each β€” no contention

Step 1: Quantify Partition Skew

Before changing any schema, establish a numeric baseline so you can confirm improvement after the fix. Static list routing often masks I/O bottlenecks until transactional latency spikes.

SELECT
  schemaname,
  relname          AS partition_name,
  n_live_tup,
  n_tup_ins,
  n_tup_upd,
  heap_blks_read,
  heap_blks_hit,
  ROUND(
    heap_blks_hit::numeric /
    NULLIF(heap_blks_hit + heap_blks_read, 0) * 100,
    1
  )                AS cache_hit_pct
FROM pg_stat_user_tables
WHERE relname LIKE 'events_%'
ORDER BY n_tup_ins DESC
LIMIT 10;

Expected output: One partition with n_tup_ins that dwarfs siblings, and cache_hit_pct below 90% on that same partition β€” both are confirmed hot-key signals.

SRE tip: Correlate with pg_stat_activity filtered to wait_event_type = 'Lock' and wait_event = 'relation' to confirm that the I/O spike is accompanied by row-level lock waits on the hot partition, not an unrelated query spike elsewhere.

Operational note: A partition holding more than 30% of total write volume is a confirmed candidate. Cardinality skew alone is not sufficient β€” also verify lock contention via pg_locks before committing to a structural change.


Step 2: Add Hash Sub-Partitioning to the Hot List Value

Hash routing algorithms distribute writes by computing hash(key) % N, and the same principle applies here at the sub-partition level. PostgreSQL 11+ lets a list child partition itself be partitioned by hash, requiring no extension.

The sub-partition column must be included in the table’s primary key β€” otherwise PostgreSQL cannot enforce uniqueness across hash buckets.

-- 1. Parent table partitioned by LIST on event_type
CREATE TABLE events (
  event_id   UUID         NOT NULL DEFAULT gen_random_uuid(),
  event_type VARCHAR(50)  NOT NULL,
  payload    JSONB,
  created_at TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
  PRIMARY KEY (event_id, event_type)         -- event_type in PK for list partitioning
) PARTITION BY LIST (event_type);

-- 2. Hot category: sub-partition by HASH on event_id to spread writes
CREATE TABLE events_click PARTITION OF events
  FOR VALUES IN ('click')
  PARTITION BY HASH (event_id);

CREATE TABLE events_click_0 PARTITION OF events_click
  FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE events_click_1 PARTITION OF events_click
  FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE events_click_2 PARTITION OF events_click
  FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE events_click_3 PARTITION OF events_click
  FOR VALUES WITH (MODULUS 4, REMAINDER 3);

-- 3. Low-volume categories remain as simple list partitions
CREATE TABLE events_purchase PARTITION OF events FOR VALUES IN ('purchase', 'refund');
CREATE TABLE events_default  PARTITION OF events DEFAULT;

Operational note: Queries filtering only on event_type = 'click' will touch all four sub-partitions. That is an acceptable read cost for eliminating write contention. If read-side performance degrades beyond tolerance, add a partial index on each sub-partition or include event_id in query predicates.

DBA tip: Start with MODULUS 4 (matching your CPU core count or I/O parallelism limit). If contention re-emerges under sustained load, doubling to 8 requires only adding four new child tables and a batched backfill β€” no structural changes to the parent.


Step 3: Zero-Downtime Migration to the Sub-Partitioned Structure

Restructuring a live table without downtime requires a side-by-side migration, not an in-place ALTER TABLE.

# Recommended migration sequence (bash pseudocode for documentation clarity)
# 1. Create the new sub-partitioned table structure alongside the live table
# 2. Enable dual-write at the application layer via a feature flag
# 3. Backfill historical data in bounded batches (avoid long-held locks)
# 4. Validate row counts and data integrity
# 5. Cut over reads to the new table; then drop the old table

For the backfill step, use bounded batches to avoid lock escalation:

-- Backfill in chunks of 10 000 rows; run until no rows remain
INSERT INTO events_new
SELECT *
FROM   events_old
WHERE  event_id > $last_seen_id
ORDER  BY event_id
LIMIT  10000;

Operational note: Run dual-write behind a feature flag in a distributed configuration store (etcd, Consul) with a versioned routing manifest. Atomic hot-reload prevents the split-brain scenario where some application instances write to the old table while others write to the new one.

SRE tip: Validate sub-partition row counts immediately after cutover. Run SELECT tableoid::regclass, count(*) FROM events GROUP BY 1; and confirm that events_click_0 through events_click_3 each hold roughly 25% of historical click rows.


Step 4: Application-Level Overflow Routing (Alternative to DDL Changes)

When DDL changes are blocked by operational constraints or the hot key is transient, redirect writes in the application tier using consistent hash routing as a fallback:

function resolvePartition(key: string, hotKeys: Set<string>): string {
  if (!hotKeys.has(key)) {
    return `partition_${key}`;
  }
  // Deterministic FNV-1a hash keeps the same key landing on the same overflow shard
  let hash = 2166136261;
  for (let i = 0; i < key.length; i++) {
    hash ^= key.charCodeAt(i);
    hash = (hash * 16777619) >>> 0;
  }
  return `overflow_${hash % 8}`;
}

Operational note: The hotKeys set must be managed by a distributed configuration store with versioned updates β€” not an environment variable or in-memory map. Stale routing tables during a rolling deployment cause split-brain writes that are difficult to reconcile without a full partition scan.

DBA tip: Pair this with Prometheus alerting on per-partition IOPS (pg_stat_user_tables exported via postgres_exporter). Alert at 70% of the overflow threshold to give the on-call engineer time to provision a new overflow shard before the current one saturates.


Verification

After the migration (or after enabling overflow routing), confirm that write load is balanced:

-- Per-partition write rates since the last stats reset
SELECT
  relname          AS partition,
  n_tup_ins        AS inserts,
  n_tup_upd        AS updates,
  ROUND(
    n_tup_ins::numeric /
    NULLIF(SUM(n_tup_ins) OVER (), 0) * 100,
    1
  )                AS insert_pct
FROM pg_stat_user_tables
WHERE relname LIKE 'events_%'
ORDER BY n_tup_ins DESC;

Expected output after fix:

partition        | inserts | updates | insert_pct
-----------------+---------+---------+-----------
events_click_0   |  248310 |    1201 |       24.9
events_click_1   |  249887 |    1198 |       25.1
events_click_2   |  250102 |    1202 |       25.1
events_click_3   |  247900 |    1199 |       24.9
events_purchase  |   75440 |     302 |        7.6
events_default   |   28360 |      88 |        2.8

No single sub-partition should hold more than 30% of total inserts once hash distribution is active. Also re-run the pg_stat_activity lock-wait check from Step 1 to confirm that Lock wait events on the click partition have dropped to zero or near-zero.


Failure Mode Table

Failure mode Root cause SRE mitigation
Queries on event_type = 'click' scan all four sub-partitions Hash sub-partitioning breaks partition pruning for predicates that do not include the hash column (event_id) Rewrite hot read queries to include event_id in the WHERE clause, or create a partial index on each sub-partition keyed to the most common read predicates
Routing table desynchronisation causes split-brain writes during deployment Application instances running old and new routing configs simultaneously; dual-write disabled too early Use an atomic config-store update with a monotonically increasing version number; gate cutover on all instances acknowledging the new version before disabling dual-write
VACUUM/ANALYZE overhead grows after sub-partitioning Autovacuum now must maintain N child tables instead of one, increasing catalog churn Tune autovacuum_vacuum_cost_delay and autovacuum_vacuum_scale_factor per sub-partition; schedule explicit ANALYZE events_click after large backfills rather than relying on autovacuum to catch up

FAQ

How do I detect hot keys in an existing list-partitioned table?

Query pg_stat_user_tables for n_tup_ins and heap_blks_read per partition, then join to pg_locks filtered on the hot partition’s OID to confirm lock contention is present. Use EXPLAIN (ANALYZE, BUFFERS) on a representative INSERT to see buffer usage. A partition with heap_blks_read consistently above siblings and a low cache_hit_pct (below 90%) is almost always a hot-key target.

Does hash sub-partitioning affect read performance for hot partitions?

Yes. Queries filtering only on event_type = 'click' reach all four hash buckets because the planner has no way to prune by hash remainder without also knowing the event_id. If those reads are latency-sensitive, include event_id in the predicate, or add a covering partial index per sub-partition. For analytical workloads that must scan all click events, the added overhead is typically negligible compared with the contention eliminated on the write path.

When should I abandon list partitioning entirely for a hot key?

When a category consistently exceeds 30% of total write volume AND the discrete category label has no archival, routing, or compliance value β€” for example, you never need to drop all data for event_type = 'click' as a unit β€” switch to hash partitioning on the primary key. Hash routing algorithms guarantee uniform distribution by design, at the cost of losing category-based partition pruning and targeted partition detachment. Compare the trade-offs against range partitioning strategies if the write load is also time-correlated.