本文共 8005 字,大约阅读时间需要 26 分钟。
1.
ps -ef | grep php
www-data 3250 28792 0 17:02 ? 00:00:01 php-fpm: pool www www-data 3252 28792 0 17:02 ? 00:00:04 php-fpm: pool www root 3435 955 0 17:44 pts/0 00:00:00 grep phproot 28792 1 0 Dec08 ? 00:00:04 php-fpm: master process (/usr/local/php/etc/php-fpm.conf) www-data 28794 28792 0 Dec08 ? 00:00:16 php-fpm: pool www www-data 29499 28792 0 Dec08 ? 00:00:09 php-fpm: pool www www-data 29699 28792 0 Dec08 ? 00:00:04 php-fpm: pool www
系统有5个进程:
追踪某个进程:
strace -p 29499
[root@grande web]# strace -p 29499Process 29499 attached - interrupt to quitrestart_syscall(<... resuming interrupted call ...>) = 0poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 1000) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 1000) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 1000) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 1000) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 1000) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 1000) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 1000) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 1000) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 1000) = 0 (Timeout)poll([{fd=11, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)发现这个进程一直poll。
Poll的作用:Poll机制会判断fds中的文件是否可读,如果可读则会立即返回,返回的值就是可读fd的数量,如果不可读,那么就进程就会休眠timeout这么长的时间,然后再来判断是否有文件可读,如果有,返回fd的数量,如果没有,则返回0.
也就是fd为11的地方卡住了
我们追踪的进程为:29499
ll /proc/29699/fdinfo/11
ll /proc/29699/fd/11
[root@grande web]# ll /proc/29499/fdinfo/total 0-r--------. 1 www-data www-data 0 Dec 9 17:54 0-r--------. 1 www-data www-data 0 Dec 9 17:54 1-r--------. 1 www-data www-data 0 Dec 9 17:54 10-r--------. 1 www-data www-data 0 Dec 9 17:54 11-r--------. 1 www-data www-data 0 Dec 9 17:54 2-r--------. 1 www-data www-data 0 Dec 9 17:54 3-r--------. 1 www-data www-data 0 Dec 9 17:54 4-r--------. 1 www-data www-data 0 Dec 9 17:54 5-r--------. 1 www-data www-data 0 Dec 9 17:54 6-r--------. 1 www-data www-data 0 Dec 9 17:54 7-r--------. 1 www-data www-data 0 Dec 9 17:54 8-r--------. 1 www-data www-data 0 Dec 9 17:54 9[root@grande web]# ll /proc/29499/fd/11lrwx------. 1 www-data www-data 64 Dec 9 14:54 /proc/29499/fd/11 -> socket:[376937]也就是对应的socket为376937
[root@grande web]# netstat -e Active Internet connections (w/o servers)Proto Recv-Q Send-Q Local Address Foreign Address State User Inode tcp 0 0 grande:27017 grande:33459 ESTABLISHED root 358776 tcp 9 0 localhost:cslistener localhost:45635 CLOSE_WAIT www-data 376904 tcp 0 0 grande:27017 grande:33794 ESTABLISHED root 363255 tcp 0 0 grande:ssh 10.10.10.132:55591 ESTABLISHED root 383850 tcp 0 0 grande:33459 grande:27017 ESTABLISHED www-data 358775 tcp 0 0 grande:ssh 10.10.10.132:61749 ESTABLISHED root 380111 tcp 0 0 grande:52228 54.183.84.53:https ESTABLISHED www-data 376937 tcp 0 0 grande:microsoft-ds 10.10.10.132:49162 ESTABLISHED root 372651 tcp 0 0 grande:ssh 10.10.10.191:64207 ESTABLISHED root 241791 tcp 0 0 grande:27017 grande:35471 ESTABLISHED root 389527 tcp 0 0 grande:35471 grande:27017 ESTABLISHED www-data 389526 tcp 9 0 localhost:cslistener localhost:EtherNet/IP-2 CLOSE_WAIT www-data 364747 tcp 0 0 grande:npmp-local 10.10.10.191:50332 ESTABLISHED nginx 392610 tcp 0 0 grande:35466 grande:27017 ESTABLISHED www-data 389471在执行过程中需要等一段时间,然后信息才会全
可以找到Inode为376937的行:
tcp 0 0 grande:52228 54.183.84.53:https ESTABLISHED www-data 376937
执行的ip为54.183.84.53,这个正好是merchant.wish.com的ip,可以断定,这个执行是 接口处问题了
还可以:
vim /proc/net/tcp50: FC0A0A0A:EB41 0BC60834:01BB 01 00000000:00000000 00:00000000 00000000 501 0 364865 1 ffff880218516380 130 3 28 4 7不过没怎么看懂这个,至少说明是一个tcp应用。 nagios自带的check_antp太过简约,除了状态统计输出外,什么参数都不提供。在面对不同应用服务器时,报警就成了很大问题。于是决定自己写一个check脚本。作脚本运行,与命令操作时一个不同,就是要考虑一下效率问题。在高并发的机器上定期运行netstat -ant命令去统计,显然不太合适,可以直接从proc系统中取数据,这就快多了。
52228:
TCP port 52228 uses the Transmission Control Protocol. TCP is one of the main protocols in TCP/IP networks. TCP is a connection-oriented protocol, it requires handshaking to set up end-to-end communications. Only when a connection is set up user's data can be sent bi-directionally over the connection.
Attention! TCP guarantees delivery of data packets on port 52228 in the same order in which they were sent. Guaranteed communication over TCP port 52228 is the main difference between TCP and UDP. UDP port 52228 would not have guaranteed communication as TCP.