Debugging CLOSE_WAIT socket leaks in PHP-FPM

Hardware and OS Initialization State

The underlying infrastructure consists of a single bare-metal server. CPU: AMD Ryzen 9 5950X (16 cores, 32 threads). RAM: 128GB ECC DDR4-3200. Storage: 2x 2TB NVMe SSDs configured in software RAID 1 (mdadm). Operating System: Ubuntu 22.04.3 LTS. Kernel: 5.15.0-87-generic.

The software stack utilizes Nginx 1.24.0 acting as a reverse proxy, passing requests via the FastCGI protocol to PHP 8.2.10 FPM. The database backend is PostgreSQL 15.4, and Redis 7.0.12 operates as the object cache.

The application layer hosts a digital asset platform operating on the GamePlex - eSports and Gaming NFT WordPress Theme. This specific theme implements backend logic that frequently communicates with external Web3 RPC endpoints (Ethereum and Polygon nodes) to verify non-fungible token ownership and query smart contract states during user authentication and profile rendering phases.

At 08:00 UTC, routine metric scraping via Prometheus and node_exporter revealed a slow, continuous upward trend in the node_netstat_Tcp_CurrEstab and related TCP socket state metrics. Available file descriptors for the www-data user were decreasing linearly over a 72-hour window. There were no spikes in incoming HTTP requests, no anomalies in CPU load, and memory utilization remained stable. The issue was a silent accumulation of orphaned network sockets.

Socket State Analysis via ss and lsof

I initiated the investigation by examining the current TCP socket states across the system. The netstat utility is deprecated and slower due to its reliance on /proc parsing, so I utilized ss (socket statistics) which queries the kernel directly via netlink sockets.

ss -tanp | awk '{print $1}' | sort | uniq -c | sort -n
      1 State
      4 SYN_RECV
     12 FIN_WAIT2
     18 FIN_WAIT1
     45 TIME_WAIT
    210 ESTAB
   8405 CLOSE_WAIT

The output indicated 8,405 sockets stagnating in the CLOSE_WAIT state. To understand why this is problematic, one must look at the TCP connection teardown sequence. When a remote server (in this case, the external Web3 RPC node) decides to terminate a connection, it sends a TCP FIN packet. The local kernel receives this FIN and immediately replies with an ACK. At this exact microsecond, the local socket transitions into the CLOSE_WAIT state.

The kernel is now waiting for the local application (PHP-FPM) to acknowledge the termination by executing the close() system call on the corresponding file descriptor. Until the application explicitly calls close(), the socket remains in CLOSE_WAIT. The kernel cannot force-close it because the application might still need to read remaining data from the receive buffer. If the application logic is flawed and never calls close(), the socket will remain in CLOSE_WAIT indefinitely, consuming a file descriptor until the process itself terminates.

To identify the specific processes holding these descriptors, I executed lsof.

lsof -n -P -i TCP -s TCP:CLOSE_WAIT | head -n 20
COMMAND    PID     USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
php-fpm 104202 www-data   14u  IPv4 9841021      0t0  TCP 192.168.1.50:48192->104.18.25.14:443 (CLOSE_WAIT)
php-fpm 104202 www-data   15u  IPv4 9841022      0t0  TCP 192.168.1.50:48194->104.18.25.14:443 (CLOSE_WAIT)
php-fpm 104202 www-data   17u  IPv4 9841028      0t0  TCP 192.168.1.50:48202->104.18.25.14:443 (CLOSE_WAIT)
php-fpm 104205 www-data   12u  IPv4 9841105      0t0  TCP 192.168.1.50:48312->104.18.25.14:443 (CLOSE_WAIT)
php-fpm 104205 www-data   18u  IPv4 9841112      0t0  TCP 192.168.1.50:48320->104.18.25.14:443 (CLOSE_WAIT)
php-fpm 104208 www-data   22u  IPv4 9841188      0t0  TCP 192.168.1.50:48418->104.18.25.14:443 (CLOSE_WAIT)

The output confirmed that the PHP-FPM pool worker processes were the culprits. The remote IP 104.18.25.14 belongs to the Cloudflare network, specifically routing to the external Web3 infrastructure provider configured in the application.

Deep Diving the Process File Descriptors

To verify the exact nature of the file descriptor leak, I isolated a single PHP-FPM worker PID (104202) and inspected its /proc filesystem directory. The Linux /proc/$PID/fd directory contains symbolic links to all files, sockets, and pipes opened by the process.

ls -l /proc/104202/fd
total 0
lrwx------ 1 www-data www-data 64 Nov 12 08:15 0 -> /dev/null
lrwx------ 1 www-data www-data 64 Nov 12 08:15 1 -> /dev/null
lrwx------ 1 www-data www-data 64 Nov 12 08:15 2 -> /var/log/php8.2-fpm.log
lrwx------ 1 www-data www-data 64 Nov 12 08:15 3 -> socket:[9840100]
lrwx------ 1 www-data www-data 64 Nov 12 08:15 4 -> anon_inode:[eventpoll]
lrwx------ 1 www-data www-data 64 Nov 12 08:15 5 -> socket:[9840102]
lrwx------ 1 www-data www-data 64 Nov 12 08:15 14 -> socket:[9841021]
lrwx------ 1 www-data www-data 64 Nov 12 08:15 15 -> socket:[9841022]
lrwx------ 1 www-data www-data 64 Nov 12 08:15 17 -> socket:[9841028]
...

The standard descriptors (0, 1, 2) mapped to /dev/null and the error log. Descriptor 3 was the listening FastCGI Unix domain socket. Descriptors 14, 15, and 17 mapped to network sockets.

I queried the kernel's socket tracking table directly to map the inode numbers from the ls output to the network connection details.

grep 9841021 /proc/net/tcp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode                                                     
  84: 3201A8C0:BC40 0E191268:01BB 08 00000000:00000000 00:00000000 00000000    33        0 9841021 1 0000000000000000 20 4 30 10 -1

Decoding the hexadecimal values: - local_address: 3201A8C0 is 192.168.1.50. BC40 is port 48192. - rem_address: 0E191268 is 104.18.25.14. 01BB is port 443. - st: 08 is the hexadecimal representation of the CLOSE_WAIT state (TCP_CLOSE_WAIT in the kernel source). - tx_queue and rx_queue are 00000000, meaning no data is stuck in the kernel buffers. The connection is fully drained but simply unclosed.

Application Logic and cURL Handling

The evidence pointed directly to the application layer failing to issue a close() syscall after the external RPC server terminated the keep-alive connection.

When users Download WordPress Themes that interface with external APIs, the themes typically utilize the WordPress HTTP API (wp_remote_get, wp_remote_post), which internally relies on the PHP cURL extension or streams.

I examined the theme's core library responsible for communicating with the Web3 node. The specific file was /var/www/html/wp-content/themes/gameplex/inc/web3/class-rpc-client.php.

private function execute_rpc_call($method, $params) {
    $ch = curl_init();

    $payload = json_encode([
        'jsonrpc' => '2.0',
        'method'  => $method,
        'params'  => $params,
        'id'      => time()
    ]);

    curl_setopt($ch, CURLOPT_URL, $this->rpc_endpoint);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_POST, true);
    curl_setopt($ch, CURLOPT_POSTFIELDS, $payload);
    curl_setopt($ch, CURLOPT_HTTPHEADER, [
        'Content-Type: application/json',
        'Connection: keep-alive'
    ]);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);

    $response = curl_exec($ch);
    $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);

    if ($http_code !== 200 || $response === false) {
        // Error logging omitted for brevity
        return false;
    }

    // Logic parsing the response
    $decoded = json_decode($response, true);

    return $decoded;
}

The logical error was evident. The code instantiated a cURL handle $ch, executed the request, retrieved the data, and returned. It failed to call curl_close($ch).

In standard PHP execution, when a variable falls out of scope, the garbage collector destroys it. If that variable is a resource like a cURL handle, its destruction automatically triggers the underlying C library to close the file descriptor.

However, PHP-FPM operates a persistent process model. Worker processes remain active to handle multiple consecutive requests. If a resource is inadvertently stored in a static variable, a persistent object cache, or if an internal reference cycle prevents garbage collection, the cURL handle persists across HTTP requests.

Furthermore, the explicit Connection: keep-alive header instructed the underlying libcurl to keep the TCP socket open for reuse. When the remote RPC server eventually hit its own idle timeout and sent a FIN packet to close the connection, the local kernel ACKed it, entering CLOSE_WAIT. But because the PHP cURL handle was never explicitly destroyed via curl_close() or garbage collection within the worker's lifecycle, the PHP process never executed the close() syscall. The socket was trapped.

Resolving the Application Layer

The primary fix required modifying the application code to enforce explicit resource cleanup. I patched class-rpc-client.php.

private function execute_rpc_call($method, $params) {
    $ch = curl_init();

    // ... curl_setopt logic remains identical ...

    $response = curl_exec($ch);
    $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);

    // Explicitly close the handle to release the file descriptor
    curl_close($ch);

    if ($http_code !== 200 || $response === false) {
        return false;
    }

    $decoded = json_decode($response, true);

    return $decoded;
}

By enforcing curl_close($ch), libcurl immediately tears down the TCP connection from the client side if keep-alive is not strictly managed by a connection pool, or it properly handles the local socket destruction when the memory is freed.

FPM and Kernel Configuration Adjustments

While the code patch resolved the immediate leak, relying solely on application-level correctness in a complex environment is insufficient. I needed to configure system-level guardrails to ensure that orphaned sockets within persistent processes are forcefully terminated.

I adjusted the PHP-FPM pool configuration. The default www.conf lacked a hard execution limit for the worker processes themselves.

; /etc/php/8.2/fpm/pool.d/www.conf
pm = dynamic
pm.max_children = 200
pm.start_servers = 20
pm.min_spare_servers = 10
pm.max_spare_servers = 30
pm.max_requests = 500

The pm.max_requests = 500 directive is critical. It instructs the FPM process manager to automatically terminate and respawn any worker process after it has served 500 requests. When the worker process is killed by the master, the operating system kernel unconditionally reclaims all file descriptors and network sockets held by that PID, wiping out any accumulated CLOSE_WAIT sockets.

Next, I examined the kernel's TCP stack parameters via sysctl.

sysctl net.ipv4.tcp_keepalive_time
sysctl net.ipv4.tcp_fin_timeout
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_fin_timeout = 60

The tcp_keepalive_time was set to the Linux default of 7200 seconds (2 hours). If a connection was established but went completely idle, the kernel would wait two hours before sending keepalive probes. This allows stale connections to persist for too long.

I modified /etc/sysctl.d/99-custom-tcp.conf to implement more aggressive network tuning.

net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_fin_timeout = 15

These parameters instruct the kernel to start probing idle connections after 5 minutes (300 seconds), send a probe every 60 seconds, and drop the connection if 5 consecutive probes fail. The tcp_fin_timeout dictates how long orphaned connections remain in the FIN_WAIT2 state before being forcibly purged by the kernel. While tcp_fin_timeout does not directly affect the CLOSE_WAIT state (which requires application intervention), tuning the overall TCP timers ensures that the underlying stack aggressively prunes dead connections.

I applied the kernel parameters.

sysctl -p /etc/sysctl.d/99-custom-tcp.conf

Finally, I restarted the PHP-FPM service to apply the pm.max_requests configuration and immediately flush the existing 8,405 stuck workers.

systemctl restart php8.2-fpm

Post-Implementation Verification

Following the service restart and code deployment, I monitored the socket states.

watch -n 2 "ss -tanp | awk '{print \$1}' | sort | uniq -c | sort -n"

The metric for CLOSE_WAIT immediately dropped to zero. Over the next 4 hours of continuous operation, serving standard traffic and background Web3 API sync tasks, the CLOSE_WAIT count never exceeded 15, representing normal transient connection teardowns. The file descriptor count per PHP-FPM worker stabilized, fluctuating predictably within normal operational margins without linear degradation.

评论 0