Diagnosing NVMe I/O Spikes with pt-query-digest and iotop
MySQL 8.0 Optimizer Regression: Fixing Implicit Casts
The maintenance window on Tuesday called for a minor version bump on our primary database cluster, moving the nodes from MySQL 8.0.28 to 8.0.32. The process completed via our Ansible playbooks without throwing any errors. Replication caught up, the service status showed active, and the staging environment tests passed. At 04:00 AM, the monitoring system triggered a low-priority alert for elevated disk I/O on the primary storage volume.
Overall CPU utilization was hovering around 15%, and RAM usage was stable at 45% of total capacity. However, the storage layer was exhibiting a steady, elevated baseline of read operations. The affected instance hosts the backend for a design agency utilizing the Austin - Multi-Purpose Design Agency WP Theme. The application was functioning normally, and HTTP response codes at the proxy layer were predominantly 200 OK. There were no user-facing timeouts. The issue was a silent degradation in query efficiency that was slowly saturating the disk controller. I logged into the database node to isolate the process responsible for the disk reads before the morning traffic ramped up.
Isolating Block Layer I/O
To pinpoint the exact resource consumption, I avoided generic monitoring dashboards and used native Linux block layer utilities. I started with iostat to view the extended block device statistics, passing the -x flag for extended metrics, sampling every 1 second.
# iostat -x 1 10
Linux 5.15.0-aws (db-primary-01) 10/24/2023 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
8.45 0.00 2.12 4.51 0.00 84.92
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wrareq-sz svctm %util
nvme1n1 4852.00 124.00 77632.00 1452.00 0.00 12.00 0.00 8.82 12.45 0.85 2.15 16.00 11.70 0.15 74.50
The output showed the nvme1n1 device hitting 74.50% utilization (%util) with a high r/s (reads per second) but relatively low w/s (writes per second). The r_await time for read operations was climbing above 12 milliseconds. For an enterprise NVMe drive directly attached via the PCIe bus, a 12ms wait time indicates that the hardware submission queues are backing up.
Next, I used iotop to map this I/O activity directly to a specific Linux process ID (PID). I ran it in batch mode to capture a clean snapshot.
# iotop -o -b -n 5
Total DISK READ: 75.81 M/s | Total DISK WRITE: 1.41 M/s
Current DISK READ: 75.81 M/s | Current DISK WRITE: 1.41 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
14820 be/4 mysql 75.80 M/s 1.20 M/s 0.00 % 45.20 % mysqld
451 be/4 root 0.00 B/s 210.00 K/s 0.00 % 0.00 % [jbd2/nvme1n1-8]
The mysqld process was consistently at the top of the list, consuming 99% of the disk read bandwidth. The database was fetching data from the physical disk instead of serving it from the InnoDB Buffer Pool in RAM. This specific access pattern suggests a query executing a full table scan that exceeds the available cache, forcing the storage engine to read pages directly from the block device.
Aggregating the Slow Query Log
I needed to identify the exact SQL statement driving the I/O operations. I dynamically adjusted the slow query log threshold on the running MySQL daemon to capture queries executing in over 0.2 seconds.
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 0.2;
SET GLOBAL log_queries_not_using_indexes = 'ON';
After letting the system capture data for fifteen minutes, I extracted the log and processed it using Percona's pt-query-digest. This utility parses the raw text file, strips variable data to create a normalized query fingerprint, and aggregates the statistics based on execution time and frequency.
# pt-query-digest /var/log/mysql/mysql-slow.log > /tmp/query-analysis.txt
Reviewing the generated text file identified a clear outlier.
# Profile
# Rank Query ID Response time Calls R/Call V/M Item
# ==== ================== ============== ===== ====== ===== =================
# 1 0x7b8c9d0e1f2a3b4c 412.45s (78%) 512 0.80s 0.12 SELECT wp_posts wp_postmeta
# Query 1: 5.6 QPS, 0.4x concurrency, ID 0x7b8c9d0e1f2a3b4c at byte 849201
# Scores: V/M = 0.12
# Time range: 2023-10-24T04:15:00 to 2023-10-24T04:30:00
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 18 512
# Exec time 78 412s 700ms 1.1s 805ms 980ms 110ms 790ms
# Lock time 4 1.8s 100us 45ms 3ms 6ms 1ms 800us
# Rows sent 0 6.14k 0 12 12 12.00 0 12.00
# Rows examine 88 716.8M 1.4M 1.4M 1.4M 1.4M 0 1.4M
SELECT SQL_CALC_FOUND_ROWS wp_posts.ID FROM wp_posts
INNER JOIN wp_postmeta ON ( wp_posts.ID = wp_postmeta.post_id )
WHERE 1=1 AND wp_posts.post_type = 'portfolio_item'
AND ( wp_postmeta.meta_key = '_project_budget' AND wp_postmeta.meta_value >= 15000 )
GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 12\G
The output of pt-query-digest highlights the specific inefficiency. The query executed 512 times in the evaluation window. The average response time was 805 milliseconds. The key metric is the ratio between Rows sent (12) and Rows examine (1.4 million). The storage engine reads 1.4 million rows from the physical disk, processes them in the CPU, and discards 1,399,988 of them just to return 12 results to the application. This confirms a full table scan on the wp_postmeta table.
The Cost-Based Optimizer and JSON Tracing
To understand why the MySQL optimizer selected a full table scan instead of using the available indexes on the post_id or meta_key columns, I ran the query with the EXPLAIN FORMAT=JSON directive. This format outputs the internal cost calculations and heuristics used by the Cost-Based Optimizer (CBO) rather than just the final execution plan.
EXPLAIN FORMAT=JSON SELECT SQL_CALC_FOUND_ROWS wp_posts.ID FROM wp_posts ... \G
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "284512.10"
},
"ordering_operation": {
"using_filesort": true,
"grouping_operation": {
"using_temporary_table": true,
"using_filesort": false,
"nested_loop":[
{
"table": {
"table_name": "wp_postmeta",
"access_type": "ALL",
"possible_keys":[
"post_id",
"meta_key"
],
"rows_examined_per_scan": 1400000,
"filtered": "10.00",
"cost_info": {
"read_cost": "42100.00",
"eval_cost": "14000.00",
"prefix_cost": "56100.00",
"data_read_per_join": "38M"
},
"attached_condition": "((`wordpress`.`wp_postmeta`.`meta_key` = '_project_budget') and (CAST(`wordpress`.`wp_postmeta`.`meta_value` AS DOUBLE) >= 15000.0))"
}
}
]
}
}
}
}
The JSON output displays "access_type": "ALL", confirming the sequential scan. The possible_keys array indicates that the optimizer evaluated the index on meta_key, but rejected it. The root cause is exposed in the attached_condition value: (CAST(\wordpress`.`wp_postmeta`.`meta_value` AS DOUBLE) >= 15000.0)`.
The database schema defines meta_value as a LONGTEXT data type. The PHP application code supplied an integer value (15000) for the filter parameter instead of a quoted string ('15000'). When a string column is compared to an integer, MySQL cannot perform a direct binary comparison. It is forced to apply implicit type conversion. The engine casts the string to a floating-point double to evaluate the mathematical condition.
Implicit type conversion instantly invalidates any B-Tree index on the affected column. A B-Tree index is an ordered data structure built on the exact binary representation of the defined data type. When the engine wraps the column in a CAST() function, the resulting calculated values do not exist in the index tree. The storage engine has no alternative but to read every single row from the disk into the InnoDB Buffer Pool, apply the cast operation via the CPU in memory, and then evaluate the condition.
This explains the sudden onset of the issue after the minor version upgrade. MySQL 8.0.x introduced subtle changes to the optimizer's cost model constants, specifically row_evaluate_cost, and altered the handling of string collations. The default collation shift to utf8mb4_0900_ai_ci makes string comparisons slightly more expensive in the optimizer's calculation. The updated CBO correctly calculated that traversing the index, performing secondary lookups to the clustered index to retrieve the full row, and then performing the cast on a complex collation was more costly than a sequential disk scan.
InnoDB Buffer Pool LRU Eviction
The secondary effect of this full table scan was the pollution of the InnoDB Buffer Pool. The buffer pool manages memory pages using a modified Least Recently Used (LRU) algorithm. It is divided into a "new" sublist (typically the head 5/8ths of the list) and an "old" sublist (the tail 3/8ths). When a page is read from the NVMe drive, it is inserted at the midpoint.
Because the wp_postmeta table was being sequentially scanned, InnoDB read thousands of 16KB pages into the old sublist. The application evaluated these rows immediately during the query execution, which promoted them to the head of the new sublist. This mechanism flushed frequently accessed, highly valuable index pages belonging to other tables out of the buffer pool.
I verified this by querying the information_schema.innodb_metrics table.
SELECT name, count FROM information_schema.innodb_metrics
WHERE name IN ('buffer_pool_reads', 'buffer_pool_read_requests', 'buffer_pages_evicted');
+---------------------------+-------------+
| name | count |
+---------------------------+-------------+
| buffer_pool_reads | 845192 |
| buffer_pool_read_requests | 248591024 |
| buffer_pages_evicted | 691024 |
+---------------------------+-------------+
The high count for buffer_pages_evicted correlated directly with the execution of the slow query. This eviction forced other, properly indexed queries to fetch their data from the disk, creating the elevated 74.50% I/O utilization baseline we initially noticed in the iostat output.
Schema Remediation via Virtual Generated Columns
Modifying the application code to strictly type the query parameters (e.g., adding quotes to make the integer a string) was not a viable immediate fix. The query originated from an obfuscated function within a generic integration plugin often found in a Free Download WooCommerce Theme package. Deploying an application-level patch would be overwritten during the next vendor update. The fix needed to reside at the database schema level.
MySQL supports Virtual Generated Columns. A virtual column computes its value from an expression based on other columns in the table on the fly. It consumes no physical disk space for the data itself. Importantly, you can build a secondary B-Tree index directly on a virtual column.
I executed an ALTER TABLE statement to create a virtual column that safely casts the meta_value to an unsigned integer.
-- Add a virtual column that attempts to cast the string to an unsigned integer
-- The CASE statement uses a regular expression to ensure we only cast strings containing digits
ALTER TABLE wp_postmeta
ADD COLUMN meta_value_int BIGINT UNSIGNED GENERATED ALWAYS AS (
CASE
WHEN meta_value REGEXP '^[0-9]+$' THEN CAST(meta_value AS UNSIGNED)
ELSE NULL
END
) VIRTUAL;
The REGEXP '^[0-9]+$' condition ensures that the cast operation only executes on rows where the meta_value is purely numeric, preventing data truncation warnings or errors on standard text metadata.
With the virtual column established, I created a composite covering index.
-- Create a specific composite index utilizing the new virtual column
ALTER TABLE wp_postmeta ADD INDEX idx_meta_key_int_val (meta_key(32), meta_value_int, post_id);
This index physically stores the pre-calculated integer values within its B-Tree nodes. The MySQL optimizer is equipped to recognize mathematical equivalence. When the application submits the flawed query with the implicit cast, the optimizer intercepts it, matches it against the definition of the meta_value_int virtual column, and reroutes the execution plan to traverse the new secondary index.
Re-running the EXPLAIN FORMAT=JSON confirmed the fix.
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "14.20"
},
"table": {
"table_name": "wp_postmeta",
"access_type": "ref",
"possible_keys":[
"post_id",
"meta_key",
"idx_meta_key_int_val"
],
"key": "idx_meta_key_int_val",
"used_key_parts":[
"meta_key",
"meta_value_int"
],
"rows_examined_per_scan": 12,
"filtered": "100.00",
"using_index": true
}
}
}
The access_type shifted from ALL to ref. The rows_examined_per_scan dropped from 1.4 million to exactly 12. The using_index flag evaluated to true, indicating that the storage engine retrieved all necessary data entirely from the index tree in RAM, without executing a secondary lookup to the clustered primary key index (the actual table rows). The query execution time dropped to 1.8 milliseconds, and the block layer I/O pressure immediately subsided.
Shielding the Database with FastCGI Caching
While the ALTER TABLE command was running (which took several minutes due to the size of the table and the creation of the index structure), I needed to shield the storage layer from further incoming requests generated by users browsing the portfolio. I implemented a targeted caching rule at the Nginx edge proxy.
Using the fastcgi_cache module, I defined a specific memory zone and applied it to the exact URI path handling the portfolio filtering.
# /etc/nginx/nginx.conf
http {
# Define the memory zone for the FastCGI cache
fastcgi_cache_path /var/run/nginx-fastcgi-cache levels=1:2 keys_zone=PORTFOLIO_CACHE:100m inactive=60m use_temp_path=off;
fastcgi_cache_key "$scheme$request_method$host$request_uri";
}
# /etc/nginx/sites-available/portfolio.conf
server {
listen 443 ssl http2;
server_name design.agency.internal;
location /portfolio/filter/ {
fastcgi_pass unix:/run/php/php8.2-fpm.sock;
include fastcgi_params;
# Enable the cache zone
fastcgi_cache PORTFOLIO_CACHE;
fastcgi_cache_valid 200 301 302 5m;
# Serve a stale cached response to the user while a background request refreshes the cache
fastcgi_cache_use_stale updating error timeout invalid_header http_500;
fastcgi_cache_background_update on;
# Prevent the cache stampede effect
# Only one request is forwarded to the PHP backend to build the cache; others wait
fastcgi_cache_lock on;
fastcgi_cache_lock_timeout 5s;
add_header X-Cache-Status $upstream_cache_status;
}
}
The configuration fastcgi_cache_use_stale updating and fastcgi_cache_background_update on serves a slightly stale cached response to the user while a single background subrequest is sent to the PHP-FPM backend to refresh the cache. The fastcgi_cache_lock on directive is the critical component here. It ensures that if 50 users request the same filtered view simultaneously while the cache is expired, only exactly one request hits the database to rebuild the cache. The other 49 requests wait in Nginx memory until the cache is populated, effectively reducing the query volume hitting the database to one request per cache TTL window.
Unix Socket Queue Inspection
To verify the health of the application layer during the mitigation and the index build, I inspected the Unix domain socket connecting Nginx to PHP-FPM. I used the ss (Socket Statistics) utility to check the socket backlog.
# ss -lx | grep php8.2-fpm.sock
u_str LISTEN 0 4096 /run/php/php8.2-fpm.sock 142918 * 0
The output shows the Recv-Q (receive queue) and Send-Q (send queue). The Send-Q reflects the listen.backlog parameter defined in the PHP-FPM pool configuration, which is set to 4096 in this environment. The Recv-Q was at 0. This indicated that the PHP worker processes were successfully keeping pace with the incoming requests, and no connections were waiting in the kernel buffer. If the Recv-Q had been consistently high, it would imply the workers were blocked on the slow database queries, which would eventually lead to the socket queue overflowing and Nginx returning 502 Bad Gateway errors. The proxy caching rule successfully isolated the workers, and the subsequent database index fix permanently resolved the latency source.
# /etc/php/8.2/fpm/pool.d/www.conf
# Verify the static process configuration
pm = static
pm.max_children = 256
pm.max_requests = 5000
评论 0