Handling Hot Keys in List Partitioned Tables
This guide covers eliminating write hotspots caused by uneven value distribution in list-partitioned tables, which is a common operational problem in the broader partitioning implementation and routing discipline.
Prerequisites
Step 1: Quantify Partition Skew
Before changing any schema, establish a numeric baseline so you can confirm improvement after the fix. Static list routing often masks I/O bottlenecks until transactional latency spikes.
SELECT
schemaname,
relname AS partition_name,
n_live_tup,
n_tup_ins,
n_tup_upd,
heap_blks_read,
heap_blks_hit,
ROUND(
heap_blks_hit::numeric /
NULLIF(heap_blks_hit + heap_blks_read, 0) * 100,
1
) AS cache_hit_pct
FROM pg_stat_user_tables
WHERE relname LIKE 'events_%'
ORDER BY n_tup_ins DESC
LIMIT 10;
Expected output: One partition with n_tup_ins that dwarfs siblings, and cache_hit_pct below 90% on that same partition β both are confirmed hot-key signals.
SRE tip: Correlate with
pg_stat_activityfiltered towait_event_type = 'Lock'andwait_event = 'relation'to confirm that the I/O spike is accompanied by row-level lock waits on the hot partition, not an unrelated query spike elsewhere.
Operational note: A partition holding more than 30% of total write volume is a confirmed candidate. Cardinality skew alone is not sufficient β also verify lock contention via pg_locks before committing to a structural change.
Step 2: Add Hash Sub-Partitioning to the Hot List Value
Hash routing algorithms distribute writes by computing hash(key) % N, and the same principle applies here at the sub-partition level. PostgreSQL 11+ lets a list child partition itself be partitioned by hash, requiring no extension.
The sub-partition column must be included in the tableβs primary key β otherwise PostgreSQL cannot enforce uniqueness across hash buckets.
-- 1. Parent table partitioned by LIST on event_type
CREATE TABLE events (
event_id UUID NOT NULL DEFAULT gen_random_uuid(),
event_type VARCHAR(50) NOT NULL,
payload JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (event_id, event_type) -- event_type in PK for list partitioning
) PARTITION BY LIST (event_type);
-- 2. Hot category: sub-partition by HASH on event_id to spread writes
CREATE TABLE events_click PARTITION OF events
FOR VALUES IN ('click')
PARTITION BY HASH (event_id);
CREATE TABLE events_click_0 PARTITION OF events_click
FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE events_click_1 PARTITION OF events_click
FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE events_click_2 PARTITION OF events_click
FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE events_click_3 PARTITION OF events_click
FOR VALUES WITH (MODULUS 4, REMAINDER 3);
-- 3. Low-volume categories remain as simple list partitions
CREATE TABLE events_purchase PARTITION OF events FOR VALUES IN ('purchase', 'refund');
CREATE TABLE events_default PARTITION OF events DEFAULT;
Operational note: Queries filtering only on event_type = 'click' will touch all four sub-partitions. That is an acceptable read cost for eliminating write contention. If read-side performance degrades beyond tolerance, add a partial index on each sub-partition or include event_id in query predicates.
DBA tip: Start with
MODULUS 4(matching your CPU core count or I/O parallelism limit). If contention re-emerges under sustained load, doubling to 8 requires only adding four new child tables and a batched backfill β no structural changes to the parent.
Step 3: Zero-Downtime Migration to the Sub-Partitioned Structure
Restructuring a live table without downtime requires a side-by-side migration, not an in-place ALTER TABLE.
# Recommended migration sequence (bash pseudocode for documentation clarity)
# 1. Create the new sub-partitioned table structure alongside the live table
# 2. Enable dual-write at the application layer via a feature flag
# 3. Backfill historical data in bounded batches (avoid long-held locks)
# 4. Validate row counts and data integrity
# 5. Cut over reads to the new table; then drop the old table
For the backfill step, use bounded batches to avoid lock escalation:
-- Backfill in chunks of 10 000 rows; run until no rows remain
INSERT INTO events_new
SELECT *
FROM events_old
WHERE event_id > $last_seen_id
ORDER BY event_id
LIMIT 10000;
Operational note: Run dual-write behind a feature flag in a distributed configuration store (etcd, Consul) with a versioned routing manifest. Atomic hot-reload prevents the split-brain scenario where some application instances write to the old table while others write to the new one.
SRE tip: Validate sub-partition row counts immediately after cutover. Run
SELECT tableoid::regclass, count(*) FROM events GROUP BY 1;and confirm thatevents_click_0throughevents_click_3each hold roughly 25% of historical click rows.
Step 4: Application-Level Overflow Routing (Alternative to DDL Changes)
When DDL changes are blocked by operational constraints or the hot key is transient, redirect writes in the application tier using consistent hash routing as a fallback:
function resolvePartition(key: string, hotKeys: Set<string>): string {
if (!hotKeys.has(key)) {
return `partition_${key}`;
}
// Deterministic FNV-1a hash keeps the same key landing on the same overflow shard
let hash = 2166136261;
for (let i = 0; i < key.length; i++) {
hash ^= key.charCodeAt(i);
hash = (hash * 16777619) >>> 0;
}
return `overflow_${hash % 8}`;
}
Operational note: The hotKeys set must be managed by a distributed configuration store with versioned updates β not an environment variable or in-memory map. Stale routing tables during a rolling deployment cause split-brain writes that are difficult to reconcile without a full partition scan.
DBA tip: Pair this with Prometheus alerting on per-partition IOPS (
pg_stat_user_tablesexported viapostgres_exporter). Alert at 70% of the overflow threshold to give the on-call engineer time to provision a new overflow shard before the current one saturates.
Verification
After the migration (or after enabling overflow routing), confirm that write load is balanced:
-- Per-partition write rates since the last stats reset
SELECT
relname AS partition,
n_tup_ins AS inserts,
n_tup_upd AS updates,
ROUND(
n_tup_ins::numeric /
NULLIF(SUM(n_tup_ins) OVER (), 0) * 100,
1
) AS insert_pct
FROM pg_stat_user_tables
WHERE relname LIKE 'events_%'
ORDER BY n_tup_ins DESC;
Expected output after fix:
partition | inserts | updates | insert_pct
-----------------+---------+---------+-----------
events_click_0 | 248310 | 1201 | 24.9
events_click_1 | 249887 | 1198 | 25.1
events_click_2 | 250102 | 1202 | 25.1
events_click_3 | 247900 | 1199 | 24.9
events_purchase | 75440 | 302 | 7.6
events_default | 28360 | 88 | 2.8
No single sub-partition should hold more than 30% of total inserts once hash distribution is active. Also re-run the pg_stat_activity lock-wait check from Step 1 to confirm that Lock wait events on the click partition have dropped to zero or near-zero.
Failure Mode Table
| Failure mode | Root cause | SRE mitigation |
|---|---|---|
Queries on event_type = 'click' scan all four sub-partitions |
Hash sub-partitioning breaks partition pruning for predicates that do not include the hash column (event_id) |
Rewrite hot read queries to include event_id in the WHERE clause, or create a partial index on each sub-partition keyed to the most common read predicates |
| Routing table desynchronisation causes split-brain writes during deployment | Application instances running old and new routing configs simultaneously; dual-write disabled too early | Use an atomic config-store update with a monotonically increasing version number; gate cutover on all instances acknowledging the new version before disabling dual-write |
VACUUM/ANALYZE overhead grows after sub-partitioning |
Autovacuum now must maintain N child tables instead of one, increasing catalog churn | Tune autovacuum_vacuum_cost_delay and autovacuum_vacuum_scale_factor per sub-partition; schedule explicit ANALYZE events_click after large backfills rather than relying on autovacuum to catch up |
FAQ
How do I detect hot keys in an existing list-partitioned table?
Query pg_stat_user_tables for n_tup_ins and heap_blks_read per partition, then join to pg_locks filtered on the hot partitionβs OID to confirm lock contention is present. Use EXPLAIN (ANALYZE, BUFFERS) on a representative INSERT to see buffer usage. A partition with heap_blks_read consistently above siblings and a low cache_hit_pct (below 90%) is almost always a hot-key target.
Does hash sub-partitioning affect read performance for hot partitions?
Yes. Queries filtering only on event_type = 'click' reach all four hash buckets because the planner has no way to prune by hash remainder without also knowing the event_id. If those reads are latency-sensitive, include event_id in the predicate, or add a covering partial index per sub-partition. For analytical workloads that must scan all click events, the added overhead is typically negligible compared with the contention eliminated on the write path.
When should I abandon list partitioning entirely for a hot key?
When a category consistently exceeds 30% of total write volume AND the discrete category label has no archival, routing, or compliance value β for example, you never need to drop all data for event_type = 'click' as a unit β switch to hash partitioning on the primary key. Hash routing algorithms guarantee uniform distribution by design, at the cost of losing category-based partition pruning and targeted partition detachment. Compare the trade-offs against range partitioning strategies if the write load is also time-correlated.
Related
- List Partitioning Techniques β parent page covering DDL setup, pruning, and routing for list-partitioned schemas
- Hash Routing Algorithms β consistent hashing and modulus routing for uniform write distribution
- Step-by-Step Guide to Implementing Consistent Hash Routing β deep-dive covering application-layer hash routing implementation