JBD2 commit stalls during WP transient pruning

Ext4 journal fragmentation and MySQL fsync latencies

Environment: RHEL 9.2, Kernel 5.14.0-284.11.1.el9_2.x86_64. Hardware: Bare metal, Dual Xeon Gold 6248R, 512GB ECC RAM, 4x 1.92TB NVMe PCIe Gen4 in software RAID 10 (mdadm). Database: MariaDB 10.11.4.

Routine metric ingestion via Prometheus Node Exporter highlighted a subtle anomaly in disk I/O latency on /dev/md0. Over a 14-day rolling window, the 95th percentile await metric for block device writes degraded from 0.8ms to 42ms, with localized peaks reaching 115ms. Read latencies remained unaffected at a baseline of 0.08ms. CPU iowait spiked to 12% on specific cores during these write latency peaks.

The application stack running on this hardware is a single staging instance utilizing the S.King | Personal Stylist and Fashion Blogger WordPress Theme. There were no error logs generated by the operating system, MariaDB, or PHP-FPM. The queue lengths on the NVMe devices were well within hardware limits.

To isolate the storage stack layer responsible for the latency, I bypassed higher-level monitoring and utilized blktrace to capture block layer events directly from the kernel ring buffer.

I initiated a trace specifically targeting write operations on one of the underlying NVMe block devices (/dev/nvme0n1), filtering for operations exceeding a 30ms completion threshold.

# blktrace -d /dev/nvme0n1 -o - | blkparse -i - -q -a write

  259,0    3      123.456789101   8452  Q  WS 12582912 + 8 [jbd2/md0-8]
  259,0    3      123.456791202   8452  G  WS 12582912 + 8 [jbd2/md0-8]
  259,0    3      123.456792301   8452  I  WS 12582912 + 8 [jbd2/md0-8]
  259,0    3      123.456795405   8452  D  WS 12582912 + 8 [jbd2/md0-8]
  259,0    3      123.571810500     0  C  WS 12582912 + 8 [0]

The output shows the lifecycle of a write request. The columns represent: major/minor device number, CPU core ID, timestamp, PID, action, command, sector, block count, and process name.

The critical delta is between the D (Dispatch to device) and C (Complete from device) events. Timestamp 123.571810500 (Complete) - 123.456795405 (Dispatch) = 0.115015095 seconds, or 115ms.

The PID responsible for these specific stalled I/O submissions was 8452, which maps to jbd2/md0-8. JBD2 is the Journaling Block Device daemon for the ext4 filesystem. The WS flag indicates a Write Synchronous operation.

When an ext4 filesystem is mounted with the default data=ordered mode, metadata is committed to the journal, but data blocks are forced to be written to the main filesystem before the metadata is marked as committed. JBD2 handles the transaction lifecycle. A transaction passes through several states: T_RUNNING, T_LOCKED, T_FLUSH, T_COMMIT, and T_FINISHED.

I utilized the kernel's ftrace mechanism via trace-cmd to observe the internal state transitions of the JBD2 transactions on /dev/md0.

# trace-cmd record -e jbd2:jbd2_run_stats -e jbd2:jbd2_commit_flushing -e jbd2:jbd2_commit_logging
# trace-cmd report

jbd2/md0-8-8452  [003] 1456.123456: jbd2_commit_flushing: dev 9,0 transaction 451239 sync_commit 1
jbd2/md0-8-8452  [003] 1456.123489: jbd2_commit_logging:  dev 9,0 transaction 451239
jbd2/md0-8-8452  [003] 1456.238510: jbd2_run_stats:       dev 9,0 transaction 451239 tid 451239 wait 0 run 12 lock 0 flushing 115 logging 0 handle_count 1 blocks 2 blocks_logged 2

The jbd2_run_stats event reveals that transaction 451239 took 115ms in the flushing phase (flushing 115). The handle_count is 1, meaning only one filesystem modification was part of this transaction. blocks 2 indicates a very small write footprint (8KB, assuming 4KB blocks).

JBD2 was stalling while executing an fdatasync() or equivalent flush command to the underlying storage hardware, forcing the NVMe disk cache to flush to non-volatile media.

To correlate this filesystem behavior with the application layer, I needed to identify what was sending the fsync or fdatasync calls to the file descriptors located on /dev/md0. I queried the auditd subsystem to log specific fdatasync syscalls.

# auditctl -a always,exit -F arch=b64 -S fdatasync -S fsync
# ausearch -sc fdatasync -i

type=SYSCALL msg=audit(1698765432.123:4567): arch=c000003e syscall=74 success=yes exit=0 a0=1a a1=0 a2=0 a3=0 items=0 ppid=1 pid=1405 auid=unset uid=27 gid=27 euid=27 suid=27 fsuid=27 egid=27 sgid=27 fsgid=27 tty=(none) ses=unset comm="mariadbd" exe="/usr/sbin/mariadbd" key=(null)

The PID 1405 is the mariadbd process. The a0=1a parameter is the file descriptor (FD 26). Checking /proc/1405/fd/26 revealed it mapped to /var/lib/mysql/ib_logfile0, the InnoDB redo log.

MariaDB's default configuration for durability is innodb_flush_log_at_trx_commit = 1. This mandates that the InnoDB redo log buffer is written to the file and flushed to disk (fsync or fdatasync) after every transaction commit.

The storage stack latency was not a hardware failure. The NVMe drives were executing the cache flushes exactly as requested. The issue was the frequency and volume of these flushes originating from the database engine.

I attached to the MariaDB instance and queried the internal InnoDB metrics to understand the transaction rate.

SHOW GLOBAL STATUS LIKE 'Innodb_data_fsyncs';
SHOW GLOBAL STATUS LIKE 'Innodb_os_log_written';

Over a 60-second polling interval, Innodb_data_fsyncs increased by 4,200. This equates to 70 fsyncs per second. While 70 IOPS is negligible for an NVMe drive regarding throughput, 70 strict fsync operations per second, each forcing a cache serialization barrier down to the flash translation layer (FTL), causes queue congestion and escalates the completion time of individual requests.

I enabled the MariaDB slow query log temporarily, setting the threshold to zero to capture all statements, ensuring I stripped SELECT queries to isolate data modifications.

# tail -n 50 /var/log/mariadb/mariadb-slow.log

# User@Host: dbuser[dbuser] @ localhost []
# Query_time: 0.000312  Lock_time: 0.000045 Rows_sent: 0  Rows_examined: 1
SET timestamp=1698765432;
UPDATE `wp_options` SET `option_value` = '1698765552' WHERE `option_name` = '_transient_timeout_sking_instagram_feed_cache';

# User@Host: dbuser[dbuser] @ localhost []
# Query_time: 0.000298  Lock_time: 0.000041 Rows_sent: 0  Rows_examined: 1
SET timestamp=1698765432;
UPDATE `wp_options` SET `option_value` = 'a:45:{...}' WHERE `option_name` = '_transient_sking_instagram_feed_cache';

The database was processing a continuous stream of UPDATE statements against the wp_options table, specifically targeting keys prefixed with _transient_.

To understand the mechanics of this operation, it is necessary to examine the physical data structure of the InnoDB storage engine. The wp_options table utilizes the COMPRESSED or DYNAMIC row format. When an UPDATE modifies a row, InnoDB performs the following microscopic sequence:

  1. The specific 16KB index page containing the target row is located in the Buffer Pool. If it is not in RAM, a physical read is issued.
  2. The page is latched (X-latch).
  3. The undo log record is generated, detailing how to revert the change. The undo log itself resides in an InnoDB page, which is also latched and modified.
  4. The redo log record is generated, detailing the physical bytes changed on both the index page and the undo log page. This record is written to the log buffer in memory.
  5. The actual bytes within the 16KB index page in the Buffer Pool are modified. The page is now marked as "dirty" in the flush list.
  6. The latches are released.
  7. The transaction commits. Because innodb_flush_log_at_trx_commit = 1, the database engine issues a write() system call to move the redo log buffer to the OS page cache for ib_logfile0, immediately followed by an fdatasync() system call.

The fdatasync() blocks the thread until the kernel, JBD2, the block layer, the NVMe driver, and the physical SSD controller confirm the data is residing in non-volatile flash memory.

The application layer was executing these transactions at a high frequency. A review of the theme's execution path revealed the source. It is standard practice when evaluating archives to Download WordPress Theme files that include bundled utility plugins or social media integrations.

In this instance, a bundled social media widget was configured to fetch external API data. Due to an API rate limiting error on the external provider, the remote endpoint was returning an HTTP 429 status code. The widget's error handling logic caught the exception but then wrote the boolean false or an empty array to the WordPress transient cache to prevent immediate re-fetching. However, the cache expiration time (the timeout transient) was being set to 60 seconds.

With approximately 80 concurrent visitors navigating the site, the cache was expiring every minute. Upon expiration, multiple PHP-FPM worker processes simultaneously evaluated the transient as expired, attempted to fetch the API, received the 429 error, and subsequently issued two separate UPDATE queries (one for the data, one for the timeout) to the wp_options table.

Because InnoDB uses row-level locking, these parallel updates to the same or adjacent rows in the wp_options clustered index caused localized contention. However, the primary bottleneck was the sequential fdatasync() operations to the redo log. Every single PHP worker executing update_option() triggered a synchronous block device flush.

The interaction between the kernel page cache and the NVMe device under these conditions creates a specific hardware-level behavior. When an fdatasync is issued, the NVMe controller receives a Flush command. If the SSD has a power-loss protection (PLP) capacitor, it can acknowledge the flush immediately while the data is still in its DRAM cache. If it lacks a PLP capacitor, or if the firmware behaves conservatively, it must physically program the NAND flash pages before sending the completion signal. The 115ms completion times observed in blktrace indicate that the NVMe controller's internal garbage collection or block erase cycles (which take significantly longer than page writes) were being induced by the rapid succession of forced flush barriers, causing subsequent flush commands to stall in the hardware queue.

The resolution requires decoupling the volatile transient data from the persistent relational database storage. WordPress provides the Object Cache API precisely to intercept these operations.

; /etc/redis/redis.conf
maxmemory 256mb
maxmemory-policy allkeys-lru
appendonly no
save ""

评论 0