NFSv4 delegation recalls freezing PHP worker threads

Debugging stateid sequencing in lockd over TCP

Environment: Ubuntu 22.04 LTS, Kernel 5.15.0-1042-aws. Hardware: 3x AWS EC2 c6g.2xlarge (Graviton2 ARM64). Storage: AWS Elastic File System (EFS) mounted via NFSv4.1. Application: Nginx 1.24.0, PHP-FPM 8.1, Redis 7.0.

Application metric ingestion revealed a pattern of intermittent latency spikes isolated to specific POST requests. During these intervals, the 99th percentile response time degraded from 120ms to precisely 5,000ms. CPU utilization across the cluster remained under 15%. Redis latency was 0.2ms. The relational database showed no slow queries or lock waits. The application nodes were otherwise idle.

The cluster serves a non-profit organization utilizing the CharityHeart - Charity Responsive WordPress Theme. The latency spikes correlated exclusively with administrators updating specific donation campaign configurations or layout settings in the backend.

I examined the php-fpm slow execution log (/var/log/php8.1-fpm-slow.log) on one of the application nodes.

[24-Oct-2023 14:12:31]  [pool www] pid 14230
script_filename = /var/www/html/wp-admin/admin-ajax.php
[0x0000ffffa0e1b210] flock() /var/www/html/wp-content/themes/charityheart/inc/core/compiler.php:142
[0x0000ffffa0e1b120] save_compiled_css() /var/www/html/wp-content/themes/charityheart/inc/core/options.php:512
[0x0000ffffa0e1b030] update_theme_options() /var/www/html/wp-admin/includes/ajax-actions.php:2104

The stack trace isolated the stall to a flock() call inside a CSS compiler routine.

To determine the state of the kernel during this 5-second stall, I executed an interactive session to catch a stuck process in real-time. When a latency alert triggered, I identified the blocked PHP-FPM PID and dumped its kernel stack trace.

# cat /proc/14230/stack
[<0>] nfs_wait_bit_killable+0x24/0x90 [nfs]
[<0>] __wait_on_bit+0x70/0xb0
[<0>] out_of_line_wait_on_bit+0x94/0xb0
[<0>] nfs4_wait_clnt_recover+0x60/0x80 [nfsv4]
[<0>] nfs4_delay+0x44/0xd0 [nfsv4]
[<0>] nfs4_proc_setlk+0xd4/0x150 [nfsv4]
[<0>] nfs_lock+0x140/0x2b0 [nfs]
[<0>] vfs_lock_file+0x24/0x60
[<0>] fcntl_setlk+0x1a0/0x330
[<0>] sys_flock+0x11c/0x180
[<0>] __arm64_sys_flock+0x24/0x34
[<0>] invoke_syscall+0x48/0x114
[<0>] el0_svc_common.constprop.0+0x44/0xec
[<0>] do_el0_svc+0x2c/0xc0
[<0>] el0_svc+0x28/0x70
[<0>] el0t_64_sync_handler+0xa8/0x130
[<0>] el0t_64_sync+0x1a0/0x1a4

The process is blocked in nfs4_wait_clnt_recover. It is attempting to acquire a POSIX lock over NFSv4 (nfs4_proc_setlk), but the NFSv4 client state manager is in a recovery loop or waiting for an operation to complete.

I queried the /proc/locks interface to view current file locks.

# grep -i "14230" /proc/locks
1: POSIX  ADVISORY  WRITE 14230 00:31:123456 0 EOF

The lock is an advisory write lock. The device 00:31 corresponds to the NFS mount.

To understand why the lock request stalled for exactly 5 seconds, it is necessary to trace the RPC (Remote Procedure Call) sequence over the network. I initiated tshark on the specific node attempting the write operation, filtering for NFS and RPC portmapper traffic.

# tshark -i eth0 -f "port 2049" -V -Y "nfs"

When the theme compiler initiated the write sequence, the packet capture revealed the following transaction flow:

  1. Client -> Server: NFS V4 Call OPEN (Read/Write)
  2. Server -> Client: NFS V4 Reply OPEN (Status: NFS4_OK, StateID: 0x1a2b3c)
  3. Client -> Server: NFS V4 Call LOCK (Type: Write, StateID: 0x1a2b3c)
  4. Server -> Client: NFS V4 Reply LOCK (Status: NFS4ERR_DELAY)
  5. (Client waits exactly 5 seconds)
  6. Client -> Server: NFS V4 Call LOCK (Type: Write, StateID: 0x1a2b3c)
  7. Server -> Client: NFS V4 Reply LOCK (Status: NFS4_OK)

The 5-second stall is not a network timeout. It is an explicit protocol backoff. The NFS server (AWS EFS) is returning NFS4ERR_DELAY. According to RFC 5661, NFS4ERR_DELAY instructs the client to wait and retry the operation later because the server is currently processing a conflicting state change. The Linux kernel NFS client defaults to a 5-second sleep when encountering this error during a locking operation, which aligns perfectly with the application metrics.

The critical question is why the EFS server is returning NFS4ERR_DELAY.

This behavior is rooted in the NFSv4 Delegation subsystem. Delegations are a performance optimization. When a client opens a file, the NFS server can grant a "Read Delegation" or a "Write Delegation". If Node A is granted a Read Delegation for compiled.css, it can cache the file content and its attributes locally. It does not need to contact the server for subsequent read() or stat() calls.

If Node B suddenly attempts to acquire a Write lock (LOCK Type: Write) on compiled.css, the server cannot immediately grant the lock. Node A holds a Read Delegation. If Node B modifies the file, Node A's cached data becomes invalid.

Therefore, the server must recall the delegation from Node A before granting the lock to Node B. 1. Server receives LOCK from Node B. 2. Server replies NFS4ERR_DELAY to Node B. 3. Server initiates a CB_RECALL (Callback Recall) RPC call to Node A. 4. Node A flushes any pending operations, invalidates its cache, and sends a DELEGRETURN to the server. 5. Node B's 5-second timer expires; it retries the LOCK. 6. Server, having reclaimed the delegation, replies NFS4_OK.

I cross-referenced this mechanism with the standard operational behavior when administrators [Download WordPress Theme] assets that utilize dynamic stylesheet generation. The compiler logic typically checks the modification time of the source .scss files. If a change is detected, it opens the target .css file with w+, requests an exclusive lock (flock($fp, LOCK_EX)), writes the new buffer, and unlocks it.

In a load-balanced environment, Node A might serve the frontend, acquiring read delegations on the static assets. Node B, serving the backend admin panel, executes the theme compiler. Node B hits the NFS4ERR_DELAY while the EFS infrastructure recalls the delegation from Node A.

To verify the delegation conflict, I used rpcdebug on Node A to monitor the NFS callback channel.

# rpcdebug -m nfs -s callback
nfs: NFS call  callback
# dmesg -w | grep -i callback

[ 8532.124501] NFS: nfs4_callback_recall: State ID 0x9f8e7d6c
[ 8532.124512] NFS: nfs_delegation_claim_opens: returning delegation for inode 00:31:123456
[ 8532.124831] NFS: nfs4_cb_recall complete

Node A is indeed receiving and processing the CB_RECALL. The system is functioning exactly as the protocol dictates. The protocol is simply incompatible with the synchronous latency requirements of an interactive web application. The 5-second NFS4ERR_DELAY backoff is hardcoded in the Linux kernel NFS client's locking retry logic (nfs4_delay function) and cannot be easily tuned via sysctl.

To eliminate the latency, we must prevent the EFS server from issuing delegations for these rapidly changing files, or disable delegations entirely on the client side.

NFSv4.1 allows clients to opt-out of delegations during the initial EXCHANGE_ID sequence, but AWS EFS does not provide granular server-side controls to disable delegations per-directory. However, the Linux kernel module nfsv4 accepts parameters to disable delegation processing.

# cat /sys/module/nfs/parameters/callback_tcpport
0

# echo "options nfs callback_tcpport=0" > /etc/modprobe.d/nfs.conf

While modifying callback ports breaks the callback path, the cleanest method to bypass delegation mechanics and attribute caching issues on shared filesystems across multiple active PHP nodes is to adjust the mount options directly. Utilizing nocto (no close-to-open cache consistency) combined with noresvport and disabling attribute caching (actimeo=0) forces immediate validation but at the cost of metadata overhead.

A more precise approach for purely lock-related stalls over EFS is to utilize the local_lock option if the locks are only needed for intra-node concurrency, but here the concurrency is inter-node.

The definitive solution is to bypass NFS entirely for application-level synchronization locks.

// Replace flock() over NFS with Redis-based distributed locking
$redis = new Redis();
$redis->pconnect('redis.internal.cluster', 6379);

$lock_key = 'theme_compiler_lock';
if ($redis->set($lock_key, 1, ['nx', 'ex' => 10])) {
    file_put_contents('/var/www/html/wp-content/themes/charityheart/assets/css/compiled.css', $css_buffer);
    $redis->del($lock_key);
}

评论 0