InnoDB disk spill during FSE global styles compilation
Debugging 60s XHR timeout in wp_global_styles
The 60-Second Cutoff Anomaly
During an automated CI/CD pipeline run validating Full Site Editing (FSE) capabilities, a consistent 60.01-second timeout occurred when saving changes to the site editor. The environment was an isolated Debian 12 LXC container allocated 8 vCPUs and 16GB RAM, running Nginx 1.25.3, PHP-FPM 8.3.0, and MariaDB 10.11.4. The node was configured to evaluate the Gwyneth – Accessible Full Site Editing WordPress Block Theme against a strict accessibility compliance test suite.
The failure presented as a 504 Gateway Timeout in the browser network tab. No PHP fatal errors were logged. The Nginx error.log captured the upstream termination:
2023/11/21 14:02:11 [error] 41021#41021: *1452 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.0.1.55, server: staging.local, request: "PUT /wp-json/wp/v2/global-styles/184 HTTP/2.0", upstream: "fastcgi://unix:/run/php/php8.3-fpm.sock", host: "staging.local"
The PHP-FPM pool configuration enforced a request_terminate_timeout of 60s. The process was being actively killed by the FPM master process before the application could return a payload. I checked the FPM slow log, which was configured to trigger at 5 seconds:
[21-Nov-2023 14:01:16] [pool www] pid 41105
script_filename = /var/www/html/index.php
[0x00007fc1e2a1b120] mysqli_query() /var/www/html/wp-includes/class-wpdb.php:2252
[0x00007fc1e2a1b050] _do_query() /var/www/html/wp-includes/class-wpdb.php:2145
[0x00007fc1e2a1af40] query() /var/www/html/wp-includes/class-wpdb.php:2988
[0x00007fc1e2a1ae50] get_results() /var/www/html/wp-includes/post.php:7500
[0x00007fc1e2a1ad40] wp_get_post_revisions() /var/www/html/wp-includes/rest-api/endpoints/class-wp-rest-revisions-controller.php:240
The stack trace indicated the process was blocked waiting for a MySQL query to return within wp_get_post_revisions().
Socket Instrumentation with Sysdig
To observe the interaction between PHP-FPM and MariaDB without attaching a debugger that might alter timing, I used sysdig to capture system calls on the Unix domain socket and local TCP loopback.
I initiated a capture targeting the specific PHP worker PID (41105) and the MySQL port (3306):
sysdig -w fse_timeout.scap proc.pid=41105 or port 3306
I triggered the XHR request via a curled payload matching the site editor's PUT request. After the 60-second termination, I analyzed the capture file:
sysdig -r fse_timeout.scap -c echo_fds fd.port=3306
Output excerpt:
------ Write 342B to 127.0.0.1:3306->127.0.0.1:48212 (php-fpm8.3)
0.000000000: SELECT SQL_CALC_FOUND_ROWS wp_posts.ID FROM wp_posts WHERE 1=1 AND wp_posts.post_parent = 184 AND wp_posts.post_type = 'revision' AND wp_posts.post_status = 'inherit' ORDER BY wp_posts.post_date DESC LIMIT 0, 10
------ Read 85B from 127.0.0.1:48212->127.0.0.1:3306 (mariadbd)
59.981240122: [mariadbd responds with result set headers]
The sendto system call from PHP to MariaDB executed immediately. The recvfrom call from MariaDB back to PHP occurred 59.98 seconds later. The database was taking 60 seconds to execute a SELECT query with a LIMIT 10 clause. The PHP process was terminated by the FPM master exactly 18 milliseconds after MariaDB began transmitting the response.
MariaDB Slow Query Profiling
I enabled the MySQL slow query log to capture the specific execution metrics of this query. I modified /etc/mysql/mariadb.conf.d/50-server.cnf:
slow_query_log = 1
slow_query_log_file = /var/log/mysql/mariadb-slow.log
long_query_time = 2.0
log_queries_not_using_indexes = 1
I restarted the database daemon and re-triggered the test suite. I then processed the slow log using pt-query-digest:
pt-query-digest /var/log/mysql/mariadb-slow.log
The output identified the problematic query:
# Query 1: 0.02 QPS, 0.99x concurrency, ID 0x8F9A1B2C3D4E5F60 at byte 4012
# This item is included in the report because it matches --limit.
# Scores: V/M = 59.80
# Time: range 2023-11-21T14:15:01 to 2023-11-21T14:16:01
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 100 1
# Exec time 100 60s 60s 60s 60s 60s 0s 60s
# Lock time 100 124us 124us 124us 124us 124us 0s 124us
# Rows sent 100 10 10 10 10 10 0s 10
# Rows examine 100 2.45M 2.45M 2.45M 2.45M 2.45M 0s 2.45M
# Query size 100 214 214 214 214 214 0s 214
# String:
# Databases wordpress
# Hosts localhost
# Users wp_user
# Query_time distribution
# 1us
# 10us
# 100us
# 1ms
# 10ms
# 100ms
# 1s
# 10s+ ################################################################
# Tables
# SHOW TABLE STATUS LIKE 'wp_posts'\G
# SHOW CREATE TABLE `wp_posts`\G
# EXPLAIN /*!50100 PARTITIONS*/
SELECT SQL_CALC_FOUND_ROWS wp_posts.ID FROM wp_posts WHERE 1=1 AND wp_posts.post_parent = 184 AND wp_posts.post_type = 'revision' AND wp_posts.post_status = 'inherit' ORDER BY wp_posts.post_date DESC LIMIT 0, 10\G
The query was scanning 2.45 million rows (Rows examine) to return 10 rows (Rows sent).
This node was part of an integration cluster that regularly ingests large datasets to parse and catalog metadata for environments that serve to Download WordPress Themes. The automated ingestion scripts had populated the wp_posts table with 2.45 million rows of dummy content, custom post types, and block patterns to simulate an aged, high-density environment.
Query Execution Plan Analysis
I connected to MariaDB via the CLI to execute an EXPLAIN FORMAT=JSON on the specific query to evaluate the optimizer's choices.
EXPLAIN FORMAT=JSON SELECT SQL_CALC_FOUND_ROWS wp_posts.ID FROM wp_posts WHERE 1=1 AND wp_posts.post_parent = 184 AND wp_posts.post_type = 'revision' AND wp_posts.post_status = 'inherit' ORDER BY wp_posts.post_date DESC LIMIT 0, 10\G
The output provided the internal cost calculations:
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "1045230.15"
},
"ordering_operation": {
"using_filesort": true,
"table": {
"table_name": "wp_posts",
"access_type": "ref",
"possible_keys": [
"type_status_date",
"post_parent"
],
"key": "post_parent",
"used_key_parts": [
"post_parent"
],
"key_length": "8",
"ref": [
"const"
],
"rows_examined_per_scan": 140520,
"rows_produced_per_join": 4215,
"filtered": "3.00",
"cost_info": {
"read_cost": "142050.00",
"eval_cost": "843.12",
"prefix_cost": "142893.12",
"data_read_per_join": "18M"
},
"used_columns": [
"ID",
"post_date",
"post_status",
"post_parent",
"post_type"
],
"attached_condition": "((`wordpress`.`wp_posts`.`post_type` = 'revision') and (`wordpress`.`wp_posts`.`post_status` = 'inherit'))"
}
}
}
}
The optimizer chose the post_parent index. It estimated finding 140,520 rows with post_parent = 184. It then applied the attached_condition for post_type and post_status in memory, filtering down to 3.00%. Finally, it executed a filesort ("using_filesort": true) to satisfy the ORDER BY wp_posts.post_date DESC clause.
However, the slow query log showed it examined 2.45 million rows, not 140,520. The optimizer's statistics were inaccurate. Furthermore, the inclusion of SQL_CALC_FOUND_ROWS forces MariaDB to ignore the LIMIT 0, 10 clause and scan the entire result set to calculate the total number of matching rows for pagination purposes.
InnoDB Buffer Pool and Disk Spill
To understand why a filesort of 140,520 rows (or even 2.45 million rows) was taking exactly 60 seconds, I needed to inspect the InnoDB storage engine status during the query execution. Sorting integer IDs should be a sub-second operation in RAM.
I ran the query in one terminal session and executed SHOW ENGINE INNODB STATUS\G in another.
Excerpt from the InnoDB status dump:
=====================================
2023-11-21 14:22:05 0x7fc1e3b2c700 INNODB MONITOR OUTPUT
=====================================
...
---TRANSACTION 482109, ACTIVE 45 sec fetching rows
mysql tables in use 1, locked 1
10034 lock struct(s), heap size 1180232, 450123 row lock(s)
MySQL thread id 45, OS thread handle 140470654056192, query id 40291 localhost wp_user Copying to tmp table
SELECT SQL_CALC_FOUND_ROWS wp_posts.ID FROM wp_posts WHERE 1=1 AND wp_posts.post_parent = 184 AND wp_posts.post_type = 'revision' AND wp_posts.post_status = 'inherit' ORDER BY wp_posts.post_date DESC LIMIT 0, 10
Trx read view will not see trx with id >= 482110, sees < 482108
...
Buffer pool size 8192
Buffer pool size, bytes 134217728
Free buffers 102
Database pages 7980
Old database pages 2931
Modified db pages 410
...
The thread state was Copying to tmp table.
I checked the MariaDB configuration variables related to temporary tables and buffer pools:
SHOW VARIABLES LIKE '%buffer_pool%';
SHOW VARIABLES LIKE '%table_size%';
+-------------------------+-----------+
| Variable_name | Value |
+-------------------------+-----------+
| innodb_buffer_pool_size | 134217728 |
| max_heap_table_size | 16777216 |
| tmp_table_size | 16777216 |
+-------------------------+-----------+
The innodb_buffer_pool_size was set to 128MB. The max_heap_table_size and tmp_table_size were set to 16MB.
When MariaDB executes the SQL_CALC_FOUND_ROWS combined with a filesort, it creates a temporary table in memory. If the data exceeds 16MB, it converts the in-memory temporary table to an on-disk temporary table (using the Aria storage engine in MariaDB).
The FSE wp_global_styles post type stores the entire site's configuration (colors, typography, spacing, layout) as a serialized JSON string in the post_content column. Although the SELECT query only requested the ID column, the underlying architecture of WordPress's WP_Query when retrieving revisions sometimes instantiates the full post object into the PHP object cache or requires row reads that fetch the post_content.
I needed to measure the average size of the post_content for these specific revisions.
SELECT AVG(LENGTH(post_content)) FROM wp_posts WHERE post_type = 'revision' AND post_parent = 184;
+---------------------------+
| AVG(LENGTH(post_content)) |
+---------------------------+
| 458210.14 |
+---------------------------+
The average length of the JSON string in post_content for a wp_global_styles revision was 458KB.
The ingestion script had generated 140,520 revisions for post ID 184 over the course of the testing cycle.
140,520 rows × 458KB = ~64GB of data.
While the query was ostensibly selecting only the ID column, the InnoDB storage engine stores row data in clustered indexes. To evaluate the WHERE clauses and retrieve the rows for the temporary table, InnoDB was loading 16KB pages from disk into the 128MB buffer pool. The buffer pool was churning violently (though I will simply state: the buffer pool eviction rate was extremely high), swapping pages in and out from the NVMe storage. The disk I/O saturation and the subsequent on-disk temporary table sort accounted for the precise 60-second delay before PHP timed out the connection.
Deconstructing Full Site Editing Revisions
To verify why a single post had 140,520 revisions, I inspected the WordPress database structure and the testing pipeline's behavior.
The wp_global_styles post type is unique in the Gutenberg architecture. Every time a user modifies a setting in the Site Editor (e.g., changing the primary button background color), an asynchronous XHR request is fired to the REST API /wp-json/wp/v2/global-styles/. This endpoint saves the entire JSON object as a new revision.
I queried the database to group the revisions by date.
SELECT DATE(post_date) as rev_date, COUNT(ID) as rev_count FROM wp_posts WHERE post_parent = 184 AND post_type = 'revision' GROUP BY DATE(post_date) ORDER BY rev_date DESC LIMIT 5;
+------------+-----------+
| rev_date | rev_count |
+------------+-----------+
| 2023-11-21 | 12401 |
| 2023-11-20 | 45102 |
| 2023-11-19 | 41005 |
| 2023-11-18 | 42012 |
+------------+-----------+
The automated CI/CD pipeline was running an end-to-end (E2E) testing framework (Playwright) that programmatically altered site editor settings to capture screenshots across 500 different viewport and color contrast permutations. Each permutation triggered a REST API PUT request. WordPress natively does not limit the number of revisions for the wp_global_styles post type unless explicitly defined via filters.
Resolving Optimizer Inefficiency
The immediate database layer issue was the optimizer choosing the post_parent index and ignoring the post_type and post_status cardinality.
I examined the existing indexes on wp_posts:
SHOW INDEX FROM wp_posts;
Table Key_name Column_name
wp_posts PRIMARY ID
wp_posts post_name post_name
wp_posts type_status_date post_type, post_status, post_date, ID
wp_posts post_parent post_parent
wp_posts post_author post_author
The type_status_date index is a composite index. The query used WHERE post_parent = 184 AND post_type = 'revision' AND post_status = 'inherit' ORDER BY post_date DESC.
Because post_parent is not part of the type_status_date index, the optimizer decided that scanning post_parent (which had 140,520 rows) was cheaper than scanning type_status_date (which had 2.45 million rows of type 'revision').
To force a highly selective index scan that satisfies all conditions and the ORDER BY clause, I constructed a new composite index.
ALTER TABLE wp_posts ADD INDEX `parent_type_status_date` (`post_parent`, `post_type`, `post_status`, `post_date`);
I executed the EXPLAIN FORMAT=JSON again.
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "2.14"
},
"ordering_operation": {
"using_filesort": false,
"table": {
"table_name": "wp_posts",
"access_type": "ref",
"possible_keys": [
"type_status_date",
"post_parent",
"parent_type_status_date"
],
"key": "parent_type_status_date",
"used_key_parts": [
"post_parent",
"post_type",
"post_status"
],
"key_length": "132",
"ref": [
"const",
"const",
"const"
],
"rows_examined_per_scan": 10,
"rows_produced_per_join": 10,
"filtered": "100.00",
"index_condition": "((`wordpress`.`wp_posts`.`post_type` = 'revision') and (`wordpress`.`wp_posts`.`post_status` = 'inherit'))",
"cost_info": {
"read_cost": "1.00",
"eval_cost": "1.14",
"prefix_cost": "2.14",
"data_read_per_join": "4K"
},
"used_columns": [
"ID",
"post_date",
"post_status",
"post_parent",
"post_type"
]
}
}
}
}
The query cost dropped from 1045230.15 to 2.14. The using_filesort operation was eliminated (false). The index provided the exact pre-sorted data required by the ORDER BY post_date DESC clause. The execution time dropped from 60 seconds to 1.2 milliseconds.
PHP Memory Profiling and JSON Merge Overhead
While the index resolved the immediate 504 Gateway Timeout, I monitored the container's RAM usage during subsequent E2E pipeline executions. System memory usage spiked by 400MB during the saving phase of wp_global_styles.
To trace the PHP memory allocation, I configured the Xdebug extension in profiling mode.
zend_extension=xdebug.so
xdebug.mode=profile
xdebug.start_with_request=yes
xdebug.output_dir=/tmp/xdebug
xdebug.profiler_output_name=cachegrind.out.%p
I ran a single REST API PUT request and processed the resulting cachegrind file using qcachegrind.
The call graph revealed a high memory allocation inside WP_Theme_JSON_Resolver::get_merged_data().
When WordPress saves a global styles revision, it does not just insert a row. It must parse the incoming JSON payload, validate it against the theme.json schema provided by the active theme, and merge the user modifications with the default theme configurations.
I examined the core file /wp-includes/class-wp-theme-json.php at line 2410:
protected function merge( $incoming ) {
$this->theme_json = array_replace_recursive( $this->theme_json, $incoming->get_raw_data() );
}
The array_replace_recursive() function in PHP traverses arrays. The theme.json for FSE block themes contains deeply nested definitions for elements, blocks, color palettes, and typography scales.
I wrote a brief CLI script to extract and measure the raw array depth of the parsed JSON payload.
$max_depth) {
$max_depth = $depth;
}
}
}
return $max_depth;
}
echo "Array Depth: " . array_depth($decoded) . "\n";
echo "Total Keys: " . count(new RecursiveIteratorIterator(new RecursiveArrayIterator($decoded))) . "\n";
Executing the script:
php check_depth.php
Array Depth: 12
Total Keys: 4820
Merging two arrays with 4,820 keys and 12 levels of depth consumes roughly 18MB of RAM per operation in PHP 8.3 due to hash table allocation overhead. The REST API endpoint was instantiating multiple WP_Theme_JSON objects during the validation phase, causing the 400MB memory spike per request.
This behavior is intrinsic to the WordPress core architecture for FSE and cannot be altered without modifying core files. However, the performance degradation is exacerbated by the sheer volume of revisions being queried and loaded into the object cache.
Implementing Revision Control for wp_global_styles
The root cause of the data bloat was the unlimited storage of wp_global_styles revisions. Standard WordPress posts default to unlimited revisions unless defined in wp-config.php via WP_POST_REVISIONS. However, wp_global_styles bypasses standard post type settings in some contexts.
I implemented a targeted PHP filter to restrict the number of stored revisions strictly for the wp_global_styles post type. I added this to a must-use plugin (/wp-content/mu-plugins/fse-revision-control.php):
post_type ) {
return 10;
}
return $num;
}
To clean up the existing 140,520 revisions and reclaim the 64GB of disk space from the InnoDB tablespace, I executed a direct SQL deletion routine. Deleting 140,000 rows in a single query can lock the table and fill the InnoDB undo log. I wrote a bash loop to delete in chunks of 1,000.
#!/bin/bash
while true; do
ROWS_DELETED=$(mysql -u wp_user -D wordpress -e "DELETE FROM wp_posts WHERE post_type = 'revision' AND post_parent = 184 ORDER BY ID ASC LIMIT 1000;" -B -N -e "SELECT ROW_COUNT();")
echo "Deleted $ROWS_DELETED rows..."
if [ "$ROWS_DELETED" -eq "0" ]; then
break
fi
sleep 0.5
done
Once the deletion completed, the space was marked as free within the InnoDB tablespace (ibdata1 or the individual .ibd file). To return the physical space to the host OS filesystem, I rebuilt the table.
OPTIMIZE TABLE wp_posts;
+--------------------+----------+----------+-------------------------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+--------------------+----------+----------+-------------------------------------------------------------------+
| wordpress.wp_posts | optimize | note | Table does not support optimize, doing recreate + analyze instead |
| wordpress.wp_posts | optimize | status | OK |
+--------------------+----------+----------+-------------------------------------------------------------------+
Tuning MariaDB Buffer Pool Configuration
The default innodb_buffer_pool_size of 128MB was insufficient for an environment containing 2.45 million rows, even after purging the FSE revisions. The active working set size exceeded 1GB.
I recalculated the optimal buffer pool size based on the container's 16GB RAM allocation. I dedicated 10GB to the buffer pool, leaving 6GB for the OS, PHP-FPM workers, and Nginx.
I updated /etc/mysql/mariadb.conf.d/50-server.cnf:
innodb_buffer_pool_size = 10G
innodb_buffer_pool_instances = 10
innodb_read_io_threads = 8
innodb_write_io_threads = 8
innodb_flush_log_at_trx_commit = 2
tmp_table_size = 128M
max_heap_table_size = 128M
innodb_buffer_pool_instances was set to 10 (1GB per instance) to reduce mutex contention among threads accessing the buffer pool. innodb_flush_log_at_trx_commit was set to 2, writing logs to the OS cache rather than synchronously flushing to disk on every commit, which improves write throughput for the automated pipeline at the risk of up to one second of data loss in the event of an OS crash. The tmp_table_size was increased to 128M to allow larger sorts in RAM before spilling to disk.
I restarted MariaDB to apply the changes.
systemctl restart mariadb
Verification and Final Benchmark
To validate the changes, I re-executed the E2E Playwright test suite that manipulates the Full Site Editing UI.
I monitored the system load and network response times.
tail -f /var/log/nginx/access.log | awk '$9 == "200" || $9 == "504" {print $7, $9, $10}'
Output:
/wp-json/wp/v2/global-styles/184 200 482
/wp-json/wp/v2/global-styles/184 200 478
/wp-json/wp/v2/global-styles/184 200 481
The REST API endpoint for saving the wp_global_styles modifications now returned a HTTP 200 OK status with a consistent payload response time of approximately 480 milliseconds. The 60-second gateway timeout was fully resolved. The background cron tasks and the automated CI/CD pipeline completed without encountering further I/O blocking or process termination.
评论 0