Tracking XFS lock contention in PHP-FPM

Debugging latency in WP asset compilation

Background

During a standard infrastructure audit, a deviation was noted in the 99th percentile response times of a specific web node. The node in question serves a corporate client operating a site built on the Starbiz – Startup Business Agency Creative Portfolio WordPress Theme. The baseline Time To First Byte (TTFB) averaged 120 milliseconds. However, telemetry indicated an intermittent spike where TTFB would hold precisely between 3,100 and 3,400 milliseconds. These spikes occurred at irregular intervals, lacking correlation with external cron invocations or standard caching expiration events.

The server runs Debian 12 with Linux kernel 6.1.0-18-amd64. The application stack consists of Nginx 1.24, PHP-FPM 8.2, and MariaDB 10.11. Storage is backed by a single PCIe Gen4 NVMe solid-state drive formatted with the XFS file system. Resource utilization metrics for CPU, memory, and network throughput remained entirely within standard operating thresholds during these latency events. System load averages did not exceed 0.8 on a 16-core allocation.

This document outlines the systematic isolation of the wait states responsible for the delayed responses.

System Architecture Specifications

To establish the operational baseline, the following configuration parameters were verified prior to diagnostic intervention.

Nginx was configured with standard asynchronous I/O settings. The worker processes were bound to specific CPU cores using worker_cpu_affinity.

worker_processes auto;
worker_rlimit_nofile 65535;
events {
    worker_connections 8192;
    use epoll;
    multi_accept on;
}
http {
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;
    client_max_body_size 64M;

    fastcgi_buffers 16 16k; 
    fastcgi_buffer_size 32k;
}

The PHP-FPM pool configuration was tuned for static process management to eliminate fork overhead during request processing.

[www]
listen = /run/php/php8.2-fpm.sock
listen.backlog = 4096
pm = static
pm.max_children = 128
pm.max_requests = 10000
request_terminate_timeout = 30s
request_slowlog_timeout = 2s
slowlog = /var/log/php-fpm/www-slow.log

MariaDB was utilizing the InnoDB storage engine with standard buffer pool allocations and standard redo log configurations.

[mysqld]
innodb_buffer_pool_size = 8G
innodb_log_file_size = 1G
innodb_flush_log_at_trx_commit = 1
innodb_flush_method = O_DIRECT
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000

The block device layout was a single partition scheme. The XFS mount options in /etc/fstab were standard defaults from the OS installer.

/dev/nvme0n1p2 / xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0

Symptom Isolation

The primary indicator of the issue was isolated within the Nginx access logs. By parsing the logs specifically for requests exceeding the two-second threshold, a pattern emerged regarding the specific endpoints affected.

awk '$NF &gt; 2.0 { print $4, $7, $NF }' /var/log/nginx/access.log | tail -n 15

Output: ```text[17/Apr/2026:14:22:01 /wp-admin/admin-ajax.php 3.214[17/Apr/2026:14:27:14 / 3.301[17/Apr/2026:14:32:05 /wp-admin/admin-ajax.php 3.198[17/Apr/2026:14:39:11 / 3.412[17/Apr/2026:14:44:02 / 3.205

The delays were not isolated to complex backend queries; they affected standard frontend requests arbitrarily. To verify if the delay was occurring within the Nginx proxy layer or the PHP-FPM application layer, the PHP-FPM slow log was inspected. The slow log `request_slowlog_timeout` was configured to 2 seconds.

```text
[17-Apr-2026 14:27:14]  [pool www] pid 114032
script_filename = /var/www/html/index.php
[0x00007f8a1b2c3d40] file_put_contents() /var/www/html/wp-content/themes/starbiz/inc/core/transient.php:142
[0x00007f8a1b2c3c90] update_theme_cache() /var/www/html/wp-content/themes/starbiz/inc/core/transient.php:89
[0x00007f8a1b2c3bd0] init_theme_compiler() /var/www/html/wp-includes/class-wp-hook.php:324

The stack trace indicated the process was stalling on a file_put_contents() operation. Specifically, the application was executing a file write operation during the HTTP request cycle. This occurs frequently when users Download WordPress Themes that bundle custom SCSS to CSS compilers or dynamic asset bundlers that utilize the local disk for cache storage instead of relying on memory-backed object stores.

A standard file_put_contents operation writing less than 50 kilobytes to an NVMe drive should complete in single-digit microseconds. A three-second delay implies a block layer wait state or a kernel-level lock contention.

Kernel State Investigation

To determine the nature of the delay, it was necessary to observe the process scheduling state when the stall occurred. A continuous monitoring script was deployed to capture the wait channel (wchan) of any PHP-FPM process entering the Uninterruptible Sleep (D) state.

while true; do
  for pid in $(pgrep -f "php-fpm: pool www"); do
    state=$(awk '{print $3}' /proc/$pid/stat)
    if[ "$state" = "D" ]; then
      wchan=$(cat /proc/$pid/wchan)
      echo "$(date +%H:%M:%S) PID: $pid State: $state WCHAN: $wchan" &gt;&gt; /var/log/d-state-trace.log
    fi
  done
  sleep 0.1
done

Reviewing /var/log/d-state-trace.log after the next latency spike provided the specific kernel function where the process was yielding the CPU.

14:44:03 PID: 114032 State: D WCHAN: xfs_log_force_lsn
14:44:03 PID: 114032 State: D WCHAN: xfs_log_force_lsn
14:44:03 PID: 114032 State: D WCHAN: xfs_log_force_lsn

The xfs_log_force_lsn function indicates that the process is waiting for the XFS journal (log) to be flushed to the physical disk up to a specific Log Sequence Number (LSN). This occurs when an application explicitly requests synchronous I/O, or when the file system metadata reaches a threshold that mandates a synchronous flush to maintain structural integrity.

To analyze the call graph leading to xfs_log_force_lsn, ftrace was employed. The trace-cmd utility provides a streamlined interface for interacting with the kernel's internal ring buffers. The objective was to trace the xfs_file_fsync and xfs_log_force operations.

trace-cmd record -p function_graph -g xfs_file_fsync -g xfs_log_force -P 114032

The resulting trace data was extracted using trace-cmd report. The output reveals the exact sequence of kernel routines invoked by the PHP process.

  php-fpm-114032 [004] 14:44:03.112450: funcgraph_entry:
  |  xfs_file_fsync() {
  |    xfs_ilock() {
  |      down_write() {
  |        rwsem_down_write_slowpath();
  |      }
  |    }
  |    xfs_log_force_lsn() {
  |      xlog_cil_force_lsn() {
  |        xlog_wait() {
  |          schedule();
  |        }
  |      }
  |    }
  |  }

The trace confirmed that xfs_file_fsync was being invoked. In the context of PHP's file_put_contents(), an fsync is not triggered by default unless the LOCK_EX flag is passed and the underlying stream wrapper enforces a sync, or the directory itself is undergoing a synchronous metadata update.

Further inspection of the application source code in /var/www/html/wp-content/themes/starbiz/inc/core/transient.php revealed the specific implementation.

$temp_file = $cache_dir . '/' . uniqid('cache_', true) . '.tmp';
file_put_contents($temp_file, $compiled_data);
rename($temp_file, $final_cache_file);

The application creates a temporary file and uses rename() to atomically replace the existing cache file. In XFS, a rename operation modifying directory metadata can trigger an fsync on the directory inode depending on the underlying VFS commit mechanisms and concurrent access patterns, particularly if security modules or strict directory synchronization are active.

However, standard metadata updates should not stall for 3,000 milliseconds on NVMe hardware. The investigation required moving down the stack to the block layer.

Block Device Layer Analysis

To understand the hardware execution time of these journal writes, blktrace was utilized. blktrace extracts sector-level event details directly from the block layer request queue.

The trace was initiated on the specific NVMe namespace.

blktrace -d /dev/nvme0n1 -o - | blkparse -i - &gt; /var/log/nvme_trace.txt

The trace was left running until a latency spike was registered. The output was then filtered for operations targeting the specific block ranges associated with the XFS external log or the specific allocation group handling the directory metadata.

In XFS, the log can be internal or external. In this setup, it is internal. A segment of the blkparse output during the latency event:

259:0   4      112450  14:44:03.112460  Q  WS 214532096 + 64 [php-fpm]
259:0   4      112451  14:44:03.112462  G  WS 214532096 + 64 [php-fpm]
259:0   4      112452  14:44:03.112464  P   N [php-fpm]
259:0   4      112453  14:44:03.112466  I  WS 214532096 + 64 [php-fpm]
259:0   4      112454  14:44:03.112468  D  WS 214532096 + 64 [php-fpm]
259:0   4      112455  14:44:06.321450  C  WS 214532096 + 64 [0]

The columns represent: Device, CPU ID, Sequence Number, Timestamp, Action, Command, Sector, Size, and Process.

The key indicators here are the timestamps and the Action flags. - Q: Request queued by block I/O layer. - G: Get request allocated. - P: Plug request (holding requests to coalesce them). - I: Inserted into request queue. - D: Issued to hardware (driver dispatch). - C: Completed by hardware.

The timestamp for the D (Dispatch) event was 14:44:03.112468. The timestamp for the C (Complete) event was 14:44:06.321450.

The hardware itself took 3.208 seconds to complete a WS (Write Synchronous) operation of 64 sectors (32 KB). An NVMe drive should complete a 32 KB write in less than 50 microseconds. A delay of 3.2 seconds at the hardware level indicates a physical storage anomaly, firmware garbage collection stall, or a queue saturation at the controller level.

To verify queue saturation, the I/O statistics of the NVMe drive were reviewed using iostat at one-second intervals during the exact timeframe.

iostat -x -d /dev/nvme0n1 1

Output fragment:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00    0.00 4820.00     0.00 77120.00    32.00   142.00   29.40    0.00   29.40   0.21 100.00
nvme0n1           0.00     0.00    0.00 5102.00     0.00 81632.00    32.00   148.00   32.10    0.00   32.10   0.19 100.00
nvme0n1           0.00     0.00    0.00   12.00     0.00   384.00    32.00   140.00 3208.00    0.00 3208.00  83.33 100.00

The w/s (writes per second) spiked to over 5,000, followed immediately by a drop to 12 w/s, while %util remained at 100%, and w_await (average wait time for write requests) spiked to 3208 milliseconds. This confirms the NVMe controller was blocking I/O processing.

Application Logic Tracing

The next step was identifying the source of the 5,000 writes per second that preceded the controller stall. While the PHP-FPM process waiting in D state was writing to a temporary file, it was the victim of the controller stall, not necessarily the cause of the IOPS flood.

fatrace (File Access Trace) was utilized to monitor system-wide file writing events to correlate with the IOPS spike.

fatrace -f W &gt; /var/log/fatrace.log

Reviewing the log at 14:44:02 to 14:44:03:

mysqld(9881): W /var/lib/mysql/ib_logfile0
mysqld(9881): W /var/lib/mysql/ib_logfile0
mysqld(9881): W /var/lib/mysql/ib_logfile0
...[4,800 identical lines omitted] ...

The MariaDB process was issuing thousands of sequential writes to the InnoDB redo log (ib_logfile0).

To understand the MariaDB behavior, the MySQL general log and slow query log were analyzed. No slow queries were recorded. However, a specific table was undergoing a high frequency of single-row updates. The table was wp_options, specifically rows where option_name matched _transient_%.

The application logic flow became clear: 1. The frontend theme compilation logic required checking transient caches in the database. 2. The logic was aggressively deleting and recreating _transient_ records in the wp_options table rather than updating them. 3. Every INSERT or DELETE operation was committed individually. 4. Because innodb_flush_log_at_trx_commit was set to 1 (the default for ACID compliance), MariaDB was issuing a synchronous fsync to the redo log for every single transaction. 5. 5,000 transient operations per second resulted in 5,000 fsync commands to the NVMe drive. 6. The specific NVMe hardware controller possessed a small pseudo-SLC cache. The sustained 5,000 fsyncs/sec depleted this cache. Once depleted, the firmware initiated an aggressive garbage collection and block erase cycle to allocate new TLC/QLC blocks, effectively halting the controller's processing queue for 3.2 seconds. 7. During this 3.2-second hardware stall, any other process attempting a metadata operation that required an fsync—such as the PHP rename() operation triggering an XFS journal flush via xfs_log_force_lsn—was placed into the D-state, causing the web request to hang.

Final Adjustments

The mitigation strategy required decoupling the application's transient cache generation from the persistent storage tier.

First, the MariaDB transaction commit behavior was modified to reduce synchronous block layer dispatches. The innodb_flush_log_at_trx_commit parameter was changed from 1 to 2. This setting instructs InnoDB to write to the OS file system cache on every commit, but only issue an fsync to the physical disk once per second. In the event of an OS crash, up to one second of transactions could be lost, which is an acceptable trade-off for transient cache data.

# /etc/mysql/mariadb.conf.d/50-server.cnf
[mysqld]
innodb_flush_log_at_trx_commit = 2

Second, the XFS file system mount options were adjusted to optimize the journaling mechanism for higher throughput, reducing the frequency of xlog_cil_push_work triggers. The logbsize (log buffer size) was increased from the default 32k to the maximum 256k. This allows the Committed Item List (CIL) to aggregate more metadata changes in memory before forcing a physical write to the journal.

# /etc/fstab
/dev/nvme0n1p2 / xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=256k,noquota 0 0

The server was rebooted to apply the block layer and file system adjustments cleanly. Following the implementation, the IOPS generated by MariaDB decreased from intermittent peaks of 5,000 down to a steady state of under 200 during cache invalidation routines. The NVMe controller queue remained unobstructed, and the PHP-FPM processes ceased entering the xfs_log_force_lsn wait state. P99 TTFB stabilized at 135 milliseconds.

morillasofi415@gmail.com

时间：2026-04-17 访问次数:109