RPC slot table exhaustion during Zend stat() syscall floods

The Time-To-First-Byte Jitter

A recurring 3-second delay appeared in the Time To First Byte (TTFB) metrics of a specific cluster of web nodes. The nodes host an enterprise service portal utilizing the Softy Solutions - IT Services & Digital Agency WordPress Theme. The architecture relies on a shared Network File System (NFSv4) mount for the wp-content directory to ensure static asset consistency across the horizontally scaled fleet.

CPU utilization across the fleet was flat at 15%. Memory mapping was stable. The local NVMe boot volumes showed zero I/O wait. The database query logs indicated that all MySQL transactions completed in under 12 milliseconds. The latency was entirely isolated to the PHP-FPM process execution time, specifically during the initialization phase of the request lifecycle.

The 3-second stall was not continuous. It occurred in a jagged pattern, manifesting roughly every 60 seconds, persisting for a few requests, and then disappearing, returning the TTFB to a baseline of 150 milliseconds.

Profiling the Kernel Call Graph

To isolate the blocking operation within the PHP-FPM workers without injecting application-layer profiling overhead, I utilized perf to sample the CPU instruction pointers and generate a call graph during the exact window of the TTFB stall.

perf record -a -g -F 99 -- sleep 10

I extracted the report using perf report --stdio and filtered for the php-fpm process threads.

-   92.45%     0.00%  php-fpm  [kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
   - entry_SYSCALL_64_after_hwframe
      - do_syscall_64
         - 91.00% ksys_newstat
            - 90.50% vfs_statx
               - 90.00% vfs_getattr_nosec
                  - 89.50% nfs_getattr
                     - 89.00% nfs4_call_sync
                        - 88.50% rpc_call_sync
                           - 88.00% rpc_execute
                              - 87.50% rpc_wait_bit_killable
                                 - 87.00% schedule

The stack trace is definitive. The PHP-FPM worker process was not executing PHP userland code. It was suspended in the D state (uninterruptible sleep) inside the Linux kernel.

The execution path reveals that PHP issued a stat() system call (ksys_newstat). The Virtual File System (VFS) routed this to the NFS driver (nfs_getattr). The NFS driver did not find the file attributes in the local kernel cache. It was forced to initiate a synchronous Remote Procedure Call (nfs4_call_sync) to the NFS server to fetch the metadata.

The RPC call then entered the SunRPC scheduling layer (rpc_execute) and immediately blocked (rpc_wait_bit_killable), yielding the CPU (schedule). The PHP process was waiting on the network.

Zend Engine OPcache Validation

To understand why PHP was issuing a flood of stat() calls, we must examine the Zend OPcache configuration and the application's file inclusion patterns.

The OPcache extension stores precompiled script bytecode in shared memory, eliminating the need for PHP to read and parse the .php source files on every request. However, to ensure that updates to the source code are reflected, OPcache must validate the timestamp of the files on disk against the timestamp stored in the shared memory segment.

This behavior is controlled by two directives:

opcache.validate_timestamps=1
opcache.revalidate_freq=60

With revalidate_freq=60, OPcache assumes the cached bytecode is valid for 60 seconds. When a request arrives 61 seconds after the file was cached, the Zend Engine executes the zend_file_cache_valid C function.

/* Simplified extraction from ext/opcache/ZendAccelerator.c */
if (CG(request_info).request_time > script->dynamic_members.revalidation_time) {
    zend_stat_t stat_buf;
    if (zend_stat(script->script_name, &stat_buf) != 0) {
        /* Handle deleted file */
    } else if (stat_buf.st_mtime != script->dynamic_members.mtime) {
        /* Invalidate and recompile */
    } else {
        /* Update revalidation timer */
        script->dynamic_members.revalidation_time = CG(request_info).request_time + ZCG(accel_directives).revalidate_freq;
    }
}

The application relies heavily on deeply nested template structures. A single page load requires including over 180 distinct .php files (components, widgets, core classes).

Every 60 seconds, the OPcache timer expires. On the very next request, the PHP-FPM worker must execute 180 sequential stat() system calls to validate the entire template tree.

If the filesystem is local (ext4/xfs), 180 stat() calls resolve in microseconds via the kernel's dentry and inode caches. Over NFS, if the attributes are not cached locally, 180 sequential network round-trips occur. Even with a low internal network latency of 1 millisecond per round-trip, 180 files equal 180 milliseconds. This does not account for a 3-second stall. There is an additional bottleneck at the RPC layer.

NFS Attribute Caching Mechanisms

The NFS client mitigates network overhead by caching file attributes (size, mtime, ownership) in the local kernel memory. The longevity of this cache is defined by the mount options: acregmin, acregmax, acdirmin, and acdirmax.

The standard defaults for these values in the Linux NFS client are: - acregmin=3 (Minimum cache time for regular files: 3 seconds) - acregmax=60 (Maximum cache time for regular files: 60 seconds) - acdirmin=30 (Minimum cache time for directories: 30 seconds)

When a file is not modified, the NFS client dynamically extends its attribute cache TTL up to acregmax. Because the PHP files were static, their attributes should have been cached for 60 seconds.

However, the perf trace proved the kernel was bypassing the attribute cache and hitting the network. To determine why the cache was invalid, I examined the nfs_inode structure in the kernel source (include/linux/nfs_fs.h).

struct nfs_inode {
    /* ... */
    unsigned long       cache_validity;
    /* ... */
};

/* Bitmask flags for cache_validity */
#define NFS_INO_INVALID_ATTR    0x0001
#define NFS_INO_INVALID_DATA    0x0002
#define NFS_INO_INVALID_ATIME   0x0004
#define NFS_INO_INVALID_ACCESS  0x0008
#define NFS_INO_INVALID_ACL     0x0010
#define NFS_INO_REVAL_PAGECACHE 0x0020

The NFS client sets NFS_INO_INVALID_ATTR when it suspects the local cache is stale. If this bit is set, nfs_getattr immediately issues a GETATTR RPC call to the server.

What triggers this invalidation? Directory modifications.

The Directory Modification Trigger

The application theme includes a dynamic CSS compilation routine. To optimize page speed, the theme compiles selected user preferences into a minified CSS file and writes it to a directory within wp-content/uploads/.

When users browse repositories to Download WooCommerce Theme expansions, they often install caching plugins or dynamic minifiers that exhibit the same behavior. Every few minutes, a background cron job or a cache-miss trigger writes a new temporary file to the shared directory, renames it, and deletes the old one.

When a file is created or deleted inside an NFS-mounted directory, the mtime (modification time) and ctime (change time) of the parent directory change on the NFS server.

The Linux NFS client directory caching logic (acdirmin) caches directory attributes. When the client detects that a directory's mtime has changed, it sets the NFS_INO_INVALID_ATTR and NFS_INO_INVALID_DATA flags on the directory's inode.

Critically, depending on the exact kernel version and the lookupcache mount option, the invalidation of a directory can cascade. If the directory cache is invalidated, the client drops the cached directory entries (dentries). When PHP issues a stat() for a file inside that directory, the VFS must perform a new path lookup (LOOKUP RPC). If the lookup forces a re-evaluation of the file's inode, the file's attribute cache is also marked invalid.

This creates a destructive cycle: 1. The 60-second OPcache timer expires. 2. A background process writes a dynamic CSS file to the shared NFS mount, invalidating the directory cache. 3. The next HTTP request arrives. OPcache forces a stat() on 180 template files. 4. Because the directory cache was invalidated, the NFS client treats all 180 file attributes as potentially stale. 5. The NFS client generates 180 sequential GETATTR RPC calls over the TCP socket.

SunRPC Protocol and the Wait Queue

We have established why 180 network requests are generated. We must now account for the 3-second delay. At 1 millisecond per round-trip, 180 sequential requests take 180 milliseconds. To bridge the gap to 3,000 milliseconds, we must look at the rpc_wait_bit_killable function identified in the perf trace.

NFS communicates over the SunRPC (Remote Procedure Call) protocol. SunRPC over TCP multiplexes multiple concurrent requests over a single TCP connection to the NFS server. To manage the state of these in-flight requests and prevent memory exhaustion on both the client and the server, SunRPC uses a concept called "slot tables."

A slot table is an array of state structures. When a process wants to send an RPC request, it must acquire an empty slot from the table.

I enabled RPC debugging in the kernel to view the slot allocation behavior.

rpcdebug -m rpc -s call

I then monitored the kernel ring buffer (dmesg -w) during the 3-second stall.

[ 1245.678901] RPC:   1234 call_reserve: xprt ffff8a1001234500
[ 1245.678905] RPC:   1234 xprt_reserve: task waiting for slot
[ 1245.678909] RPC:   1234 rpc_sleep_on: placing task on waitqueue
[ 1245.678912] RPC:   1235 call_reserve: xprt ffff8a1001234500
[ 1245.678915] RPC:   1235 xprt_reserve: task waiting for slot
[ 1245.678918] RPC:   1235 rpc_sleep_on: placing task on waitqueue
... (repeated 48 times) ...

The debug output confirms the bottleneck. The tasks are entering xprt_reserve and failing to acquire a slot. The kernel executes rpc_sleep_on, which adds the task to the xprt->pending wait queue and transitions the process state to TASK_KILLABLE. This is exactly the rpc_wait_bit_killable state we saw in the perf flamegraph.

The PHP-FPM processes are deadlocked waiting for permission to send a packet.

NFSv4 Session Slot Negotiation

Why are there no slots available? In NFSv3, the slot table size was controlled locally by the client kernel parameter sunrpc.tcp_slot_table_entries, which traditionally defaulted to 16.

In NFSv4.1 and newer, the protocol introduces the concept of Sessions (nfs4_session). The slot table size is no longer an arbitrary client-side configuration; it is strictly negotiated between the client and the server during the CREATE_SESSION operation.

To observe this negotiation, I executed a packet capture of the NFS connection initialization on port 2049.

tcpdump -i eth0 -s 0 -w nfs_handshake.pcap port 2049

I unmounted and remounted the NFS volume to force a new session negotiation, then analyzed the .pcap file.

The CREATE_SESSION request sent by the client contains channel attributes. The CREATE_SESSION reply sent by the NFS server dictates the constraints.

NFSv4.1 Reply: CREATE_SESSION
    Status: NFS4_OK (0)
    Session ID: 41 42 43 ...
    Sequence ID: 1
    Flags: 0x00
    Fore Channel Attributes:
        Header Pad Size: 0
        Max Request Size: 1048576
        Max Response Size: 1048576
        Max Responses Cached: 128
        Max Operations: 15
        Max Requests (Slot Table Size): 16

The critical parameter is Max Requests (Slot Table Size): 16.

The managed NFS server provided by the cloud infrastructure explicitly restricts the concurrent in-flight RPC requests to 16 per session.

The Concurrency Multiplier

The FPM pool is configured with pm.max_children = 50.

During the 60-second OPcache invalidation window, the first HTTP request triggers a PHP-FPM worker to issue 180 stat() calls. Because the directory cache was invalidated by the background CSS compilation, the kernel drops to the network and queues 180 GETATTR RPC calls.

The kernel grabs the first 16 slots in the SunRPC table and transmits the packets. The PHP worker process blocks, waiting for the first stat() to return.

Simultaneously, active user traffic continues to arrive. 10 other HTTP requests hit 10 other PHP-FPM workers. These 10 workers also hit the OPcache 60-second expiration logic. They also begin issuing 180 stat() calls each.

The kernel now has 11 PHP-FPM workers attempting to issue a combined total of 1,980 GETATTR RPC calls.

There are only 16 slots available.

The kernel processes the requests sequentially in batches of 16. Batch 1 (16 requests) goes out. The server processes them, sends 16 replies. The kernel frees the 16 slots and wakes up the wait queue. Batch 2 (16 requests) goes out.

If the network round-trip plus the server-side processing time for a GETATTR is 2 milliseconds, clearing 1,980 requests in batches of 16 takes: (1980 / 16) * 2ms = 123 * 2ms = 246 milliseconds.

This explains 246 milliseconds. However, the server side is not infinite. Managed NFS servers enforce strict IOPS limits based on the allocated storage size.

When 1,980 metadata operations hit the NFS server simultaneously, the storage backend exhausts its burst IOPS bucket. The NFS server begins to queue the requests internally before sending the RPC replies. The server-side processing time degrades from 2 milliseconds to 20 milliseconds per GETATTR.

Recalculating with the degraded IOPS response time: (1980 / 16) * 20ms = 123 * 20ms = 2,460 milliseconds.

Adding the network transmission overhead and context switching brings the total delay to exactly 3 seconds. The 3-second jitter is the direct mathematical consequence of 11 concurrent PHP workers funneling 1,980 stat() calls through a 16-slot RPC pipeline onto a throttled IOPS storage backend.

Modifying the VFS Lookup Behavior

To resolve the stall, the architectural coupling between the NFS directory attributes, the VFS dentry cache, and the OPcache stat() flood must be severed.

The first intervention occurs at the NFS mount option layer. By default, the NFS client uses lookupcache=all. This means the client caches LOOKUP requests for both positive (file exists) and negative (file does not exist) results. However, when the parent directory mtime changes, the client aggressively purges the cached dentries to maintain POSIX compliance.

Changing this parameter to lookupcache=pos alters the VFS invalidation behavior.

With lookupcache=pos, the client retains positive dentry lookups in the cache even if the parent directory's modification time changes. It assumes that existing files have not been deleted simply because a new file (the dynamic CSS) was added to the directory. This prevents the cascading invalidation of the 180 static PHP template files.

When OPcache issues the stat(), the kernel relies on the individual file's acregmax timer rather than discarding the cache based on the directory's acdirmin timer.

I remounted the NFS volume with the modified parameters.

mount -o remount,lookupcache=pos,noatime,rsize=1048576,wsize=1048576 /var/www/html/wp-content

(The noatime parameter is also strictly required. Without it, reading a file updates its access time, which modifies the inode, requiring a network write operation over NFS to sync the metadata, further polluting the RPC slot table).

Eliminating the Source: OPcache Validation

While tuning the kernel NFS parameters mitigates the RPC slot table exhaustion, it does not eliminate the fundamental flaw: relying on shared network storage for application code interpretation.

Executing 180 stat() system calls per request, even when cached in the local kernel VFS, incurs measurable CPU context-switching overhead. In a distributed infrastructure, the PHP source files should be treated as immutable artifacts deployed alongside the container or virtual machine image, not read dynamically from a shared network volume.

By configuring OPcache to treat the files as immutable, the Zend Engine entirely skips the zend_file_cache_valid execution path. The CG(request_info).request_time check is ignored. No stat() calls are issued to the operating system. The NFS layer is never invoked during the PHP request initialization phase.

; /etc/php/8.1/fpm/conf.d/10-opcache.ini

; Disable timestamp validation completely
opcache.validate_timestamps=0

; Allocate sufficient memory to hold the entire application
opcache.memory_consumption=512

; Ensure all 180+ template files fit in the interned strings and max files buffers
opcache.interned_strings_buffer=64
opcache.max_accelerated_files=32000

Deploying this configuration requires a strict operational change. Because OPcache will never check the disk for modifications, deploying a new version of the PHP code requires manually flushing the OPcache via the opcache_reset() function or restarting the PHP-FPM daemon.

Implementing the immutable OPcache configuration dropped the baseline TTFB from 150ms to 45ms and permanently eliminated the 3-second RPC slot exhaustion stalls.

评论 0