PHP-FPM accept queue drops on FreeBSD 13.2
DTrace analysis of ZFS ARC lock contention
Environment: FreeBSD 13.2-RELEASE-p3, GENERIC amd64. Hardware: Bare metal, Dual AMD EPYC 7302, 256GB ECC RAM. Storage: 2x 1.92TB NVMe in ZFS mirror (zroot). Application Layer: Nginx 1.24.0, PHP-FPM 8.2.10.
Monitoring metrics indicated intermittent 502 Bad Gateway responses originating from Nginx. The frequency was low, approximately 0.05% of total requests, but persistent. CPU utilization hovered at a stable 12%. Network interface queues (ix0) showed no drops or overruns. Memory was fully allocated to the ZFS ARC, leaving 4GB free, which is normal for FreeBSD.
The affected application is a heavily visual staging site utilizing the Park - Creative Portfolio WordPress Theme. The issue did not correlate with peak request rates. It occurred randomly throughout the 24-hour cycle.
I checked the socket states on the local UNIX domain socket connecting Nginx to PHP-FPM using netstat -Lan.
# netstat -Lan | grep php
unix 0/0/1024 /var/run/php-fpm.sock
unix 129/0/128 /var/run/php-fpm-portfolio.sock
The second queue (php-fpm-portfolio.sock) shows an anomaly: 129/0/128. The first value is the current length of the listen queue. The third is the maximum allowed (listen() backlog). The queue overflowed its 128 limit. When this happens, the kernel drops the SYN (or equivalent for domain sockets), Nginx fails to connect, and a 502 is logged.
The queue is overflowing not because of a lack of FPM workers (the pool is configured with pm = static, pm.max_children = 200), but because the workers are blocked and unable to call accept() on the socket.
I needed to observe what the PHP-FPM workers were doing when they blocked. Standard top or procstat polling is too slow to catch micro-stalls. I utilized DTrace, the native dynamic tracing framework in FreeBSD, to profile the off-CPU time of the php-fpm processes.
I wrote a DTrace script (offcpu.d) to capture the kernel stack trace of any php-fpm process that blocked in the kernel for more than 50 milliseconds.
#!/usr/sbin/dtrace -s
sched:::off-cpu
/execname == "php-fpm"/
{
self->ts = mach_absolute_time();
}
sched:::on-cpu
/self->ts && (mach_absolute_time() - self->ts) > 50000000/
{
@stacks[stack()] = count();
}
sched:::on-cpu
/self->ts/
{
self->ts = 0;
}
I executed the script and waited for a 502 error to appear in the Nginx logs.
# dtrace -s offcpu.d
dtrace: script 'offcpu.d' matched 3 probes
^C
kernel`sleepq_wait+0x2c
kernel`_sleep+0x1d2
kernel`sx_xlock_hard+0x3d0
kernel`zfs_zget+0xc4
kernel`zfs_lookup+0x334
kernel`zfs_vptofh+0x14
kernel`vfs_hash_get+0xea
kernel`vop_stdlookup+0x62
kernel`vfs_cache_lookup+0xae
kernel`lookup+0x4a7
kernel`namei+0x4d5
kernel`kern_statat+0x8b
kernel`sys_fstatat+0x3b
kernel`amd64_syscall+0x10c
14
The stack trace points directly to kern_statat, escalating through the Virtual File System (VFS) layer, entering the ZFS module (zfs_lookup), and blocking on sx_xlock_hard inside zfs_zget.
An sx_xlock is a shared/exclusive lock in FreeBSD. The PHP-FPM process is attempting to acquire an exclusive lock on a ZFS data structure and is being put to sleep (sleepq_wait) because another thread holds it.
To understand why stat calls were blocking on ZFS locks, I needed to trace the specific file paths being accessed. I modified the DTrace script to capture the arguments to stat and open calls that took longer than 50ms.
syscall::fstatat:entry,
syscall::openat:entry
/execname == "php-fpm"/
{
self->path = copyinstr(arg1);
self->ts = mach_absolute_time();
}
syscall::fstatat:return,
syscall::openat:return
/self->ts && (mach_absolute_time() - self->ts) > 50000000/
{
printf("Slow I/O: %d ns on path: %s\n",
mach_absolute_time() - self->ts, self->path);
}
syscall::fstatat:return,
syscall::openat:return
/self->ts/
{
self->ts = 0;
self->path = 0;
}
Output:
Slow I/O: 87402100 ns on path: /usr/local/www/park/wp-content/themes/park/template-parts/svg/icon-arrow.php
Slow I/O: 92100400 ns on path: /usr/local/www/park/wp-content/themes/park/template-parts/portfolio/grid-item-01.php
Slow I/O: 88201500 ns on path: /usr/local/www/park/wp-content/themes/park/template-parts/svg/icon-behance.php
The blocked calls were consistently interacting with microscopic .php files inside the theme's template-parts directory.
A common design pattern when developers build or administrators Download WordPress Theme archives focused on complex portfolios is to break the UI into hundreds of reusable components. Instead of inline SVGs or unified sprite sheets, every icon and grid element is a separate .php file included via get_template_part().
A single page render requires PHP to stat() and open() over 400 distinct files. While PHP Opcache caches the compiled bytecode, the PHP engine still verifies the file modification time (stat) on every request to ensure the cached script is not stale, unless explicitly told otherwise.
These 400 stat() calls per request trigger ZFS metadata lookups. ZFS stores metadata (permissions, timestamps, file sizes) in structures called System Attributes (SA), which are cached in the ZFS ARC (Adaptive Replacement Cache).
I checked the ARC statistics using arcstat.
# arcstat -f time,hits,miss,read,arcsz,c,meta
time hits miss read arcsz c meta
09:21:00 98% 2% 0 240G 240G 48G
09:21:01 99% 1% 0 240G 240G 48G
09:21:02 82% 18% 4.2K 240G 240G 48G
The hit rate occasionally dipped to 82%, correlating with a burst of disk reads. The meta column showed 48GB of the ARC was dedicated to metadata.
I dumped the specific ZFS kernel tuneables.
# sysctl vfs.zfs.arc | grep meta
vfs.zfs.arc.meta_limit: 64424509440
vfs.zfs.arc.meta_min: 0
vfs.zfs.arc.meta_strategy: 1
The vfs.zfs.arc.meta_limit is bounded at ~64GB (which is 25% of the 256GB total RAM, the FreeBSD default). The ARC size (arcsz) is 240GB.
The issue lies in the ARC eviction mechanics. ZFS maintains several lists for the ARC: MRU (Most Recently Used), MFU (Most Frequently Used), and their ghost lists. When a PHP worker requests metadata for icon-arrow.php, ZFS checks the ARC. If the metadata is not in the ARC, ZFS issues a synchronous disk read.
However, the NVMe drives are fast (sub-millisecond reads). The 80ms delay captured by DTrace is not a disk I/O bottleneck. It is a lock contention bottleneck inside the ARC code.
When the metadata limit (64GB) is near exhaustion due to the filesystem containing millions of small files (an OS acting as a dense web server), the ZFS ARC eviction thread (arc_reclaim_thread) activates to free space.
To evict an entry, or to promote an entry from the MRU to the MFU list, ZFS must lock the specific hash bucket in the ARC. The OpenZFS codebase uses a multilist structure protected by an array of mutexes.
If 100 PHP-FPM workers simultaneously process requests, resulting in 40,000 parallel stat() calls to various template-parts, they all attempt to traverse and modify the ARC lists. If the ARC is simultaneously running an eviction cycle, the arc_reclaim_thread acquires exclusive locks (sx_xlock_hard) on these hash buckets.
The PHP workers hit the locked bucket, go to sleep, and the accept queue in the kernel fills up instantly. Nginx drops the connection.
To resolve this, the workload must be adapted to avoid triggering the ZFS metadata lock contention entirely. We must stop PHP from issuing the stat() syscalls.
PHP Opcache has a directive opcache.revalidate_freq which dictates how often it checks timestamps. In this staging environment, it was set to 0 (check on every request) to aid development.
Furthermore, PHP maintains a realpath_cache to resolve paths, which also has a TTL.
; /usr/local/etc/php/8.2/conf.d/10-opcache.ini
opcache.validate_timestamps=1
opcache.revalidate_freq=60
opcache.max_accelerated_files=30000
; /usr/local/etc/php/8.2/php.ini
realpath_cache_size=4096K
realpath_cache_ttl=120
评论 0