Optimizing Cross-Partition Aggregations with Materialized Views
This guide shows how to deploy per-partition materialized views in PostgreSQL to eliminate the cross-shard fan-out latency that plagues distributed SUM, COUNT, and AVG queries — part of the broader Cross-Partition Querying & Aggregation Strategies approach.
Prerequisites
Step 1 — Diagnose the fan-out bottleneck
Before building views, confirm that cross-partition fan-out is the actual bottleneck rather than a missing index or a suboptimal query shape.
-- Capture the full distributed execution plan
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT
DATE_TRUNC('day', created_at) AS day,
SUM(amount) AS total_amount,
COUNT(*) AS order_count
FROM orders
WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY 1
ORDER BY 1;
What to look for: Append nodes that fan out to every child partition, Hash Aggregate nodes on the coordinator with large Batches counts, and high shared hit / shared read buffer ratios on partitions that are outside the query’s date range (indicating missing partition pruning).
DBA tip: If the
EXPLAINoutput showsParallel Seq Scanon every child partition, your partition key is not aligning with theWHEREclause predicate. Fix partition pruning first — materialized views will not help if the query ignores the partition boundary.
Step 2 — Create one materialized view per child partition
Standard PostgreSQL does not allow CREATE MATERIALIZED VIEW ... PARTITION BY .... Instead, create one view per child partition. This mirrors the partition boundary exactly, enabling index-only scans and eliminating any cross-node data movement at refresh time.
-- Q1 partition
CREATE MATERIALIZED VIEW mv_orders_2024_q1_daily AS
SELECT
DATE_TRUNC('day', created_at) AS day,
SUM(amount) AS total_amount,
COUNT(*) AS order_count
FROM orders_2024_q1
GROUP BY 1
WITH DATA;
CREATE UNIQUE INDEX idx_mv_q1_day ON mv_orders_2024_q1_daily (day);
-- Q2 partition
CREATE MATERIALIZED VIEW mv_orders_2024_q2_daily AS
SELECT
DATE_TRUNC('day', created_at) AS day,
SUM(amount) AS total_amount,
COUNT(*) AS order_count
FROM orders_2024_q2
GROUP BY 1
WITH DATA;
CREATE UNIQUE INDEX idx_mv_q2_day ON mv_orders_2024_q2_daily (day);
Operational note: The UNIQUE INDEX on day is required for two reasons: it enables REFRESH MATERIALIZED VIEW CONCURRENTLY (which needs a unique index to identify changed rows), and it supports the ON CONFLICT (day) DO UPDATE merge pattern in Step 3.
SRE tip: Automate this DDL. When your automated partition creation workflow provisions a new child partition, a post-creation hook should immediately create the corresponding MV and its unique index. Gaps here are a common source of silent query degradation after partition rollovers.
The diagram below shows how this architecture separates the write path (raw inserts into child partitions) from the read path (index-only scans on per-partition MVs, unioned by the coordinator):
Step 3 — Implement incremental delta refresh
Full REFRESH MATERIALIZED VIEW rebuilds the entire view from scratch, acquiring an exclusive lock (or a slower non-blocking scan with CONCURRENTLY). For high-write tables, use an append-only staging table and atomic delta merges instead.
-- Apply incremental delta from staging into the Q1 MV
BEGIN;
INSERT INTO mv_orders_2024_q1_daily (day, total_amount, order_count)
SELECT
DATE_TRUNC('day', created_at) AS day,
SUM(amount) AS total_amount,
COUNT(*) AS order_count
FROM orders_staging
WHERE created_at >= '2024-01-01'
AND created_at < '2024-04-01'
GROUP BY 1
ON CONFLICT (day) DO UPDATE
SET total_amount = mv_orders_2024_q1_daily.total_amount + EXCLUDED.total_amount,
order_count = mv_orders_2024_q1_daily.order_count + EXCLUDED.order_count;
-- Purge processed staging rows
DELETE FROM orders_staging
WHERE created_at >= '2024-01-01'
AND created_at < '2024-04-01';
COMMIT;
Operational note: The ON CONFLICT ... DO UPDATE makes each delta application idempotent. If the merge transaction is retried after a network error (common in CDC-driven pipelines), the accumulated totals remain correct rather than double-counting.
SRE tip: Trigger this merge from a Kafka consumer, a CDC hook (Debezium, pglogical), or a scheduled pg_cron job — not a wall-clock cron that ignores write volume. A fixed 1-minute cron fires at the same rate during off-peak hours as during peak, wasting resources. An event-driven trigger fires proportionally to write throughput.
Step 4 — Route reads through a hybrid coordinator query
The coordinator unions per-partition MV results for historical ranges and bypasses the MV for rows within the freshness window. This is the same query-time composability that federated query execution relies on, applied here at the coordinator level rather than at a separate query engine.
-- Hybrid aggregation: MV for history, live table for the freshness window
SELECT day, SUM(total_amount) AS global_total, SUM(order_count) AS global_count
FROM (
-- Historical segments from per-partition MVs
SELECT day, total_amount, order_count FROM mv_orders_2024_q1_daily
UNION ALL
SELECT day, total_amount, order_count FROM mv_orders_2024_q2_daily
UNION ALL
-- Sub-second-fresh rows from the live partitioned table
SELECT
DATE_TRUNC('day', created_at) AS day,
amount AS total_amount,
1 AS order_count
FROM orders
WHERE created_at > NOW() - INTERVAL '5 minutes'
) sub
GROUP BY day
ORDER BY day;
Operational note: Parameterise the INTERVAL '5 minutes' as a configuration value in your application layer. When MV refresh lag increases (detectable via the verification query in Step 5), widen the freshness window rather than letting stale MV rows contaminate results.
SRE tip: Wrap the coordinator query in a function that accepts
p_freshness_interval INTERVALand checkspg_stat_user_tablesfor the last MV vacuum timestamp before deciding whether to widen the fallback window. This avoids manual intervention during refresh lag events.
Step 5 — Verify correctness and performance
Run these queries after initial deployment and after each schema change to confirm the MV is serving the expected data and that the query planner is using index scans rather than sequential scans.
-- 1. Check the planner uses an index-only scan on the MV
EXPLAIN (ANALYZE, BUFFERS)
SELECT day, total_amount, order_count
FROM mv_orders_2024_q1_daily
WHERE day BETWEEN '2024-01-01' AND '2024-01-31';
-- Expected: "Index Only Scan using idx_mv_q1_day on mv_orders_2024_q1_daily"
-- Red flag: "Seq Scan" — rebuild the unique index if it was created before WITH DATA
-- 2. Cross-check MV totals against the live partition for a complete date range
SELECT
'mv' AS source, SUM(total_amount) AS total, SUM(order_count) AS cnt
FROM mv_orders_2024_q1_daily
UNION ALL
SELECT
'live' AS source, SUM(amount), COUNT(*)
FROM orders_2024_q1;
-- Expected: both rows return identical totals (within delta-merge lag tolerance)
-- 3. Confirm refresh recency via pg_stat_user_tables
SELECT relname, last_analyze, last_autoanalyze
FROM pg_stat_user_tables
WHERE relname LIKE 'mv_orders_%'
ORDER BY relname;
Expected output for the cross-check query:
source | total | cnt
--------+--------------+---------
mv | 4821903.22 | 198442
live | 4821903.22 | 198442
(2 rows)
Any divergence between mv and live rows indicates either a missed delta merge or an incomplete WITH DATA refresh. Re-run Step 3’s delta merge or issue a full REFRESH MATERIALIZED VIEW CONCURRENTLY mv_orders_2024_q1_daily to resync.
Failure mode table
| Failure mode | Root cause | SRE mitigation |
|---|---|---|
| Coordinator returns stale totals during write bursts | Delta merge lag exceeds the freshness window; MV rows are older than NOW() - INTERVAL '5 minutes' |
Widen the freshness window in the hybrid query, or trigger an emergency delta merge; alert on pg_stat_user_tables.last_analyze age exceeding SLA |
REFRESH MATERIALIZED VIEW CONCURRENTLY fails with “index required” |
The unique index on day was dropped or never created, so PostgreSQL cannot identify changed rows |
Re-create the unique index (CREATE UNIQUE INDEX idx_mv_q1_day ON mv_orders_2024_q1_daily (day)) before retrying the refresh |
| Cross-check totals diverge after a retry storm | ON CONFLICT DO UPDATE accumulates double-counted deltas when staging rows are not purged atomically |
Wrap the delta merge and staging DELETE in a single transaction; ensure the staging table has a unique constraint on (created_at, order_id) to prevent duplicate inserts |
FAQ
How do I handle materialized view staleness during high-throughput write bursts?
Implement delta-based refresh triggers with a maximum staleness threshold. Route queries exceeding the threshold directly to live partitions via a fallback in the coordinator query. The hybrid UNION ALL pattern in Step 4 handles this automatically when you parameterise the freshness interval. For burst scenarios, reduce the freshness window to 1–2 minutes and increase the delta merge frequency rather than issuing full refreshes.
Can a single materialized view span multiple child partitions?
Yes — an MV can SELECT from multiple child partition tables in a single query, because child partitions are regular tables from the MV’s perspective. However, refresh cost grows linearly with partition count, and a failure in any partition’s scan aborts the entire refresh. Per-partition MVs unioned at the coordinator give better refresh granularity, lower per-refresh cost, and let you stagger refresh windows across partitions to smooth I/O.
What refresh interval is appropriate for real-time dashboards?
Use event-driven delta merges triggered by CDC events for sub-second freshness. If your pipeline does not yet have CDC, schedule REFRESH MATERIALIZED VIEW CONCURRENTLY at 1–5 minute intervals and set the hybrid query’s freshness window to match. Dashboards backed by proxy routing architectures can also direct dashboard traffic specifically to read replicas where MVs are refreshed on a tighter schedule without competing with write workloads.
Related
- Cross-Shard Aggregation Patterns — parent page covering scatter-gather, push-down aggregation, and the trade-offs between live and pre-aggregated query strategies
- Federated Query Execution — coordinating queries across multiple independent PostgreSQL instances, including result merging and connection fanout management
- Proxy Routing Architectures — configuring HAProxy and PgBouncer to route read traffic to MV-backed replicas without application changes