High disk read wait in scheduled tasks
System Background
I manage a server. The server is a physical machine. The server sits in a data center. The data center is in Chicago. The server runs the Ubuntu operating system. The version is 22.04. The Linux kernel version is 5.15. The server has one processor. The processor has eight cores. The server has 32 gigabytes of memory. The server has one solid state drive. The drive uses the NVMe protocol. The drive has 500 gigabytes of space.
The server hosts a website. The website is for an insurance agency. The client bought a design. The client uses the Insurx – Insurance Agency WordPress Theme. The client installed the theme. The website looks good. The website works well during the day. Users browse the pages. Users fill out forms. The server sends emails. The server load is usually low.
I monitor the server every day. I use a monitoring tool. The tool is Zabbix. Zabbix collects data. Zabbix draws graphs. I look at the graphs. I saw an anomaly. The anomaly happened every hour. The anomaly happened at the top of the hour. The time was 10:00. The time was 11:00. The time was 12:00.
I checked the CPU graph. The CPU usage was normal. The CPU usage was 10 percent. I checked the memory graph. The memory usage was normal. The server used 8 gigabytes of memory. The server had 24 gigabytes of free memory. I checked the network graph. The network traffic was low. The traffic was 2 megabits per second.
I checked the disk graph. The disk graph showed a problem. The disk wait time went up. The metric is called iowait. The iowait measures time. The CPU waits for the disk. The CPU does nothing. The CPU just waits. The iowait reached 40 percent. This number is high. A normal number is 1 percent. The disk wait lasted for two minutes. Then the disk wait dropped to normal. The pattern repeated every hour. I needed to find the cause.
Initial Checks
I logged into the server. I used the SSH protocol. I opened my terminal. I typed the ssh command. I pressed the enter key. I typed my password. I saw the server welcome message.
I checked the system time. I typed the date command. I pressed enter. The terminal printed the date. The time was correct. The time zone was UTC. I checked the disk space. I typed the df -h command. I pressed enter. The terminal printed a table. The root partition had 300 gigabytes of free space. The disk was not full.
I checked the memory again. I typed the free -m command. I pressed enter. The output showed memory in megabytes. The total memory was 32000. The used memory was 8000. The free memory was 24000. The swap memory was zero. The system did not use swap memory. Swap memory can cause high disk wait. The system writes memory pages to the disk. The disk is slower than memory. But the swap was zero. The swap was not the problem.
I checked the system logs. I went to the log folder. I typed cd /var/log. I pressed enter. I checked the system message log. I typed cat syslog | grep error. I pressed enter. The output was empty. I checked the kernel log. I typed dmesg | tail -n 20. I pressed enter. The output showed normal boot messages. The kernel did not report hardware errors. The NVMe drive was healthy.
Disk Monitor
I needed to see the disk activity. I used the iostat command. I typed iostat -x 1 10. I pressed enter. The -x flag shows extended statistics. The 1 means one second interval. The 10 means ten repetitions. The tool printed data every second.
I looked at the output. I read the columns. I found the %util column. This column shows disk utilization. The utilization was 1 percent. I found the await column. This column shows the average wait time for disk operations. The wait time was 2 milliseconds. The disk was fast right now. The time was 10:15. The problem only happens at the top of the hour. I had to wait.
I prepared a better tool. The tool is iotop. The iotop program shows disk usage by process. It is like the top command for disks. I typed sudo iotop -o -d 2. I pressed enter. I typed my password. The -o flag hides idle processes. The -d 2 flag sets a two-second delay. The screen cleared. The screen showed a table. The table was empty. No processes used the disk heavily.
I waited. I watched the clock. The clock reached 10:59. I watched the terminal. The clock reached 11:00. The iotop table changed. A new process appeared. The process ID was 4502. The user was mysql. The command was mysqld. The mysqld is the database server.
I looked at the read column. The read speed was 0 bytes per second. I looked at the write column. The write speed was 150 megabytes per second. The MySQL database wrote 150 megabytes to the disk every second. This caused the disk queue to fill up. This caused the high iowait. The write operation stopped after 110 seconds. The mysqld process disappeared from the top of the list. The disk usage dropped. The iowait dropped. I found the process. The process was MySQL. I needed to know what MySQL was doing.
Database Investigation
I connected to the database. I used the MySQL client tool. I typed mysql -u root -p. I pressed enter. I typed the database password. I saw the mysql> prompt.
I checked the active queries. I typed SHOW FULL PROCESSLIST;. I pressed enter. The command lists all current connections. The command lists the queries running on those connections. The table had five rows. Four rows were in the Sleep state. Sleep means the connection is idle. One row was active.
I read the active row. The user was wordpress. The database was insurx_db. The command was Query. The time was 45 seconds. The query ran for 45 seconds. This is very slow. I read the state column. The state was Copying to tmp table on disk. This state is the core problem.
MySQL needs to sort data sometimes. MySQL uses memory to sort data. MySQL creates a temporary table in memory. Memory has limits. If the data is bigger than the memory limit, MySQL changes the plan. MySQL moves the temporary table to the disk. The disk is slow. The disk writing caused the high wait time.
I copied the query. I pasted the query into a text file. I typed quit to exit MySQL. I opened the text file. I used the nano editor. I typed nano query.txt. I read the query carefully.
SELECT
p.ID,
p.post_title,
m1.meta_value AS policy_type,
m2.meta_value AS expiration_date
FROM
wp_posts p
LEFT JOIN
wp_postmeta m1 ON p.ID = m1.post_id AND m1.meta_key = '_policy_type'
LEFT JOIN
wp_postmeta m2 ON p.ID = m2.post_id AND m2.meta_key = '_expiration_date'
WHERE
p.post_type = 'insurance_policy'
AND p.post_status = 'publish'
ORDER BY
m2.meta_value DESC;
I read line 1. The query selects columns. I read line 7. The query reads the wp_posts table. I read line 8. The query joins the wp_postmeta table. It gets the policy type. I read line 10. The query joins the wp_postmeta table again. It gets the expiration date. I read line 12. The query filters by post type. The type is insurance_policy. I read line 15. The query orders the results. The query sorts by the expiration date. The sort order is descending.
This query is common. The WordPress database uses the Entity-Attribute-Value model. The model is flexible. The model is slow for large datasets. The wp_postmeta table had many rows. The query fetched all policies. The query fetched all meta values. The query joined them. Then the query sorted them. The sorting required a big temporary table.
Theme Analysis
I needed to find the source of the query. Why did the query run every hour? I searched the website files. I went to the web root folder. I typed cd /var/www/html. I pressed enter.
I used the grep command. The grep command searches text in files. I typed grep -rn "insurance_policy" wp-content/. I pressed enter. The command searched the wp-content folder. The command searched all files recursively. The terminal printed a list of files.
I looked at the results. I found a file. The file path was wp-content/themes/insurx/inc/cron-jobs.php. I opened the file. I used the nano editor. I typed nano wp-content/themes/insurx/inc/cron-jobs.php. I pressed enter.
I read the PHP code. I looked at line 15. I saw a WordPress function. The function was wp_schedule_event. The schedule was hourly. The hook was insurx_check_expiring_policies.
I read line 25. I found the function definition. The function name was insurx_check_expiring_policies(). I read the function body. The function used the WP_Query class. The function arguments asked for the insurance_policy post type. The function asked for all posts. The function asked to sort by the _expiration_date meta key.
The developer wrote this code. The developer wanted to check for expired policies every hour. The developer wanted to send emails to clients. The intention was good. The execution was poor. Many people Download WordPress Themes and install them. They do not read the code. The code assumes a small database. The client had a big database. The client had 50,000 policies. Sorting 50,000 policies by a meta value takes memory.
MySQL Configuration Check
I checked the MySQL configuration file. I needed to see the memory limits. I typed cat /etc/mysql/mysql.conf.d/mysqld.cnf. I pressed enter. I read the output. I searched for temporary table settings.
I found line 45. The line was tmp_table_size = 16M. This means 16 megabytes.
I found line 46. The line was max_heap_table_size = 16M. This means 16 megabytes.
These values are small. The query created a data set. The data set size was larger than 16 megabytes. The MySQL engine saw the size. The MySQL engine stopped using memory. The MySQL engine created a file on the disk. The file went to the /tmp folder. The MySQL engine wrote the data to the disk. Then the MySQL engine sorted the data on the disk. Disk sorting is slow. It causes high iowait.
I needed to change the settings. I wanted to give MySQL more memory for temporary tables. I opened the file with an editor. I typed nano /etc/mysql/mysql.conf.d/mysqld.cnf. I pressed enter.
I moved the cursor to line 45. I deleted 16M. I typed 128M. The new limit was 128 megabytes.
I moved the cursor to line 46. I deleted 16M. I typed 128M. The new limit was 128 megabytes.
I saved the file. I pressed Ctrl+O. I pressed enter. I exited the editor. I pressed Ctrl+X.
I restarted the MySQL service. I typed systemctl restart mysql. I pressed enter. The command took three seconds. The service restarted. The new settings were active.
I waited for the next hour. The time became 12:00. I watched the iotop screen. The mysqld process appeared. The disk write speed was low. The disk read speed was low. The iowait did not spike. The CPU usage went up slightly. The query finished in 3 seconds. The disk wait time problem was gone. The memory fixed the symptom.
Index Optimization
Increasing memory fixed the symptom. Increasing memory did not fix the root cause. The root cause was the database structure. The query was still slow. Three seconds is slow for a background query. The query scanned many rows. I needed to analyze the query plan.
I connected to MySQL again. I typed mysql -u root -p. I typed the password. I selected the database. I typed USE insurx_db;.
I used the EXPLAIN command. The EXPLAIN command shows the execution plan. I typed EXPLAIN and pasted the query. I pressed enter.
The console printed a table. I read the table. I looked at the type column. The first row had the type ALL. ALL means a full table scan. MySQL read every row in the wp_posts table. MySQL read 50,000 rows. I looked at the Extra column. The text said Using where; Using temporary; Using filesort.
This confirmed the bad plan. MySQL used a temporary table. MySQL used a file sort. I needed to add indexes. An index makes lookups fast. It is like an index in a book.
I checked the existing indexes on wp_postmeta. I typed SHOW INDEX FROM wp_postmeta;. I pressed enter. The table showed the primary key. The primary key was on meta_id. The table showed an index on post_id. The table showed an index on meta_key.
There was no composite index. A composite index covers two columns. The query searched by post_id and meta_key together. I created a new index. I typed a command.
CREATE INDEX idx_post_meta_key ON wp_postmeta (post_id, meta_key);
I pressed enter. The command ran. The command took 5 seconds. The database created the index. I ran the EXPLAIN command again. I typed EXPLAIN and pasted the query. I pressed enter.
The console printed a new table. I read the new table. The type column for the wp_postmeta rows changed. The type was ref. Ref means index lookup. The database did not scan the whole table. The database used the new index.
I ran the query manually. I pasted the query. I pressed enter. The results appeared quickly. The execution time was 0.4 seconds. The time dropped from 45 seconds to 0.4 seconds. The index improved the speed.
PHP and Nginx Verification
I fixed the database side. I wanted to verify the web server side. The cron job runs via PHP. WordPress cron jobs run when a user visits the site. This is called a pseudo-cron.
I checked the PHP configuration. I typed cat /etc/php/8.1/fpm/pool.d/www.conf. I pressed enter. I read the file. I looked at line 115. The line was pm.max_children = 50. The server allows 50 PHP processes. I looked at line 372. The line was request_terminate_timeout = 60s.
This setting is dangerous. The timeout was 60 seconds. The bad database query took 45 seconds. The query almost reached the timeout limit. If the query took 61 seconds, PHP would kill the process. The cron job would fail. The emails would not send.
I opened the PHP configuration file. I typed nano /etc/php/8.1/fpm/php.ini. I pressed enter. I searched for memory_limit. I found the line. The value was 256M. This is a normal value. I searched for max_execution_time. I found the line. The value was 30.
The max_execution_time was 30. But the FPM pool timeout was 60. The settings were inconsistent. I fixed the php.ini file. I changed max_execution_time to 60. I saved the file. I closed the editor.
I restarted the PHP service. I typed systemctl restart php8.1-fpm. I pressed enter. The service restarted.
I checked the Nginx configuration. I typed cat /etc/nginx/nginx.conf. I pressed enter. I read the file. I checked the worker settings. The line was worker_processes auto;. This setting is correct. The Nginx server scales to the CPU cores.
I checked the virtual host file. I typed cat /etc/nginx/sites-available/insurx.conf. I pressed enter. I read the location ~ \.php$ block. I looked for the fastcgi timeout settings.
location ~ \.php$ {
include snippets/fastcgi-php.conf;
fastcgi_pass unix:/var/run/php/php8.1-fpm.sock;
fastcgi_read_timeout 60s;
}
The fastcgi_read_timeout was 60 seconds. Nginx waits for PHP. Nginx waits for 60 seconds. If PHP takes 61 seconds, Nginx drops the connection. Nginx returns a 504 Gateway Timeout error.
Before I fixed the database, the cron job took 45 seconds. The user who triggered the cron job waited for 45 seconds. The user saw a blank screen for 45 seconds. This is a bad user experience. Now, the query takes 0.4 seconds. The user does not wait. The page loads fast.
Real Cron Implementation
The WordPress pseudo-cron system is flawed. It relies on website traffic. If no user visits the site at 10:00, the 10:00 cron job does not run. It waits for the next visitor. This is not precise. Insurance policy emails need precise timing.
I disabled the pseudo-cron. I opened the WordPress configuration file. I typed nano /var/www/html/wp-config.php. I pressed enter. I went to the end of the file. I added a new line.
define( 'DISABLE_WP_CRON', true );
I saved the file. I closed the editor. The website no longer triggers cron jobs on page visits.
I created a real system cron job. I used the crontab tool. I typed crontab -e -u www-data. I pressed enter. The -u www-data flag edits the crontab for the web server user. The editor opened.
I added a line at the bottom.
*/15 * * * * wp --path=/var/www/html cron event run --due-now > /dev/null 2>&1
I used the WP-CLI tool. The tool runs WordPress commands from the terminal. The */15 part means every 15 minutes. The command checks for due events. The command runs them. The > /dev/null 2>&1 part hides the output.
I saved the crontab. I closed the editor. The terminal printed crontab: installing new crontab. The system cron now controls the tasks. The timing is exact. The web requests are completely separated from background tasks.
Network Observation
I wanted to make sure everything was clean. I used tcpdump. The tcpdump command captures network packets. I typed tcpdump -i lo port 3306. I pressed enter. The -i lo flag listens on the local loopback interface. The port 3306 flag listens to MySQL traffic.
I watched the screen. The screen showed packet data. I saw fast bursts of packets. The web server asked MySQL for data. MySQL returned the data immediately. I did not see long pauses between request packets and response packets. The communication was fast.
I pressed Ctrl+C. The capture stopped. The terminal printed a summary. The tool captured 450 packets. The tool dropped 0 packets. The network stack was healthy.
The disk wait problem was resolved. The iowait remained at 1 percent. The memory buffer held the temporary tables. The database index sped up the queries. The real cron system ran tasks precisely. The infrastructure was stable. I logged out of the server. I typed exit. I pressed enter. The SSH connection closed.
评论 0