Linux 交换分区耗尽

今天在跑自己的业务进程的时候发现很卡，看了一下这个 free 命令的输出

$ free -m
             total       used       free     shared    buffers     cached
Mem:          7972        727       7244          0          0        420
-/+ buffers/cache:        306       7666
Swap:         1913       1913          0

可以看到，物理内存几乎是空闲的，但是，交换分区耗尽了。按照我们的理解，Linux 对于内存的使用策略，应该是优先使用物理内存，直到物理内存不够用的时候，才开始动用交换分区。那么，很自然的，我们就要提一个问题了，（别怕，不是问挖掘机），为什么会出现上述的这种情况。细分来说主要有以下几个问题：

什么情况才会使得 Linux 系统出现这种物理内存没用完就去用交换分区的行为
在这个情景下，交换分区中到底是什么鸟东西
出现这种情况，怎么释放交换分区，怎么让系统恢复正常

按照国际惯例，首先放狗搜，在这里，http://tech.foolpig.com/2012/1…?，可以看到有人写了一个脚本，用于统计每个进程对于交换分区的占用。

#!/bin/bash

##############################################################################
# 脚本功能 ： 列出正在占用swap的进程。
###############################################################################

echo -e "PID\t\tSwap\t\tProc_Name"

# 拿出/proc目录下所有以数字为名的目录（进程名是数字才是进程，其他如sys,net等存放的是其他信息）
for pid in `ls -l /proc | grep ^d | awk '{ print $9 }'| grep -v [^0-9]`
do
    # 让进程释放swap的方法只有一个：就是重启该进程。或者等其自动释放。放
    # 如果进程会自动释放，那么我们就不会写脚本来找他了，找他都是因为他没有自动释放。
    # 所以我们要列出占用swap并需要重启的进程，但是init这个进程是系统里所有进程的祖先进程
    # 重启init进程意味着重启系统，这是万万不可以的，所以就不必检测他了，以免对系统造成影响。
    if [ $pid -eq 1 ];then continue;fi
    grep -q "Swap" /proc/$pid/smaps 2>/dev/null
    if [ $? -eq 0 ];then
        swap=$(grep Swap /proc/$pid/smaps \
            | gawk '{ sum+=$2;} END{ print sum }')
        proc_name=$(ps aux | grep -w "$pid" | grep -v grep \
            | awk '{ for(i=11;i<=NF;i++){ printf("%s ",$i); }}')
        if [ $swap -gt 0 ];then
            echo -e "${pid}\t${swap}\t${proc_name}"
        fi
    fi
done | sort -k2 -n | awk -F'\t' '{
    pid[NR]=$1;
    size[NR]=$2;
    name[NR]=$3;
}
END{
    for(id=1;id<=length(pid);id++)
    {
        if(size[id]<1024)
            printf("%-10s\t%15sKB\t%s\n",pid[id],size[id],name[id]);
        else if(size[id]<1048576)
            printf("%-10s\t%15.2fMB\t%s\n",pid[id],size[id]/1024,name[id]);
        else
            printf("%-10s\t%15.2fGB\t%s\n",pid[id],size[id]/1048576,name[id]);
    }
}'

样例输出如下：

[root@slide Downloads]# ./check_swap_use2.sh
PID		             Swap      Proc_Name
4165      	             32KB	klogd -x
4179      	             56KB	irqbalance
4162      	             64KB	syslogd -m 0
4221      	             64KB	/sbin/mingetty tty2
4222      	             64KB	/sbin/mingetty tty3
4223      	             64KB	/sbin/mingetty tty4
4224      	             68KB	/sbin/mingetty tty5
4225      	             68KB	/sbin/mingetty tty6
14145     	            124KB	rsync
14166     	            124KB	rsync
14139     	            128KB	rsync
31109     	            244KB	/bin/sh /usr/local/mysql-5.5.8/bin/mysqld_safe
18417     	            280KB	/usr/local/nagios-3.2.3/bin/nrpe
1478      	            376KB	/sbin/udevd -d
27204     	            400KB	/usr/bin/perl
4197      	            452KB	/usr/sbin/sshd
4210      	            508KB	crond
25954     	            832KB	/usr/bin/redis-server
25927     	            988KB	/usr/bin/redis-server
25964     	           1.01MB	/usr/bin/redis-server
24197     	           1.37MB	/usr/bin/redis-server
25860     	           1.55MB	/usr/bin/redis-server
26746     	           2.02MB	/usr/local/net-snmp-5.6.1/sbin/snmpd
17008     	           2.09MB	/usr/bin/redis-server
4897      	           4.23MB	/usr/local/redis-2.2.4/bin/redis-server
25677     	          11.74MB	/usr/bin/redis-server
19694     	          21.12MB	/usr/local/redis-2.4.17/bin/redis-server
25873     	          22.85MB	/usr/bin/redis-server
14631     	          23.45MB	/usr/bin/redis-server
31361     	          47.79MB	/usr/local/mysql-5.5.8/bin/mysqld
26043     	         110.27MB	/usr/bin/redis-server
25808     	         127.93MB	/usr/bin/redis-server

自行写脚本固然可以，但是有没有简便的办法呢，在这里，http://stackoverflow.com/quest…?，有另外一位大哥给出了一个依赖系统自带的 top 的方法，在 top 命令中，按 f，就是 field 的意思，调出列选项，把 swap 列显示出来，然后按 O，大写的，应该是 order 的意思，选择按照 swap 列进行排序，可以看到下面的结果

  PID USER      PR  NI  VIRT  SHR  RES S %CPU %MEM    TIME+  SWAP COMMAND
19899 root      22   0 4710m  288 227m S    0  2.9   0:02.72 4.4g java
 3633 mysql     17   0  752m  116 1484 S    0  0.0   0:00.50 750m mysqld

可以看到就是 java 这个大家伙占用了 4G 的交换分区空间，（但是，注意，之前看到的，交换分区一共才 2G 左右啊），其次是 mysql 的守护进程，占用了 750M，（这里又一个注意，这么一来，各个进程的交换分区占用加起来，远远比交换分区的总量大）。

可惜的是，这个方法虽然简单，但是不靠谱。在这个 stackoverflow 的问答下面，评论里有人说了：

Your accepted answer is wrong. Consider changing it to lolotux’s answer, which is actually correct.

为什么不靠谱呢，原因如下：

It is not possible to get the exact size of used swap space of a process. Top fakes this information by making SWAP = VIRT – RES, but that is not a good metric, because other stuff such as video memory counts on VIRT as well (for example: top says my X process is using 81M of swap, but it also reports my system as a whole is using only 2M of swap. Therefore, I will not add a similar Swap column to htop because I don’t know a reliable way to get this information (actually, I don’t think it’s possible to get an exact number, because of shared pages).

下面有一个更高票的答案，还是走的自行写脚本的方法：

#!/bin/bash
# Get current swap usage for all running processes
# Erik Ljungstrom 27/05/2011
# Modified by Mikko Rantalainen 2012-08-09
# Pipe the output to "sort -nk3" to get sorted output
# Modified by Marc Methot 2014-09-18
# removed the need for sudo

SUM=0
OVERALL=0
for DIR in `find /proc/ -maxdepth 1 -type d -regex "^/proc/[0-9]+"`
do
    PID=`echo $DIR | cut -d / -f 3`
    PROGNAME=`ps -p $PID -o comm --no-headers`
    for SWAP in `grep VmSwap $DIR/status 2>/dev/null | awk '{ print $2 }'`
    do
        let SUM=$SUM+$SWAP
    done
    if (( $SUM > 0 )); then
        echo "PID=$PID swapped $SUM KB ($PROGNAME)"
    fi
    let OVERALL=$OVERALL+$SUM
    SUM=0
done
echo "Overall swap used: $OVERALL KB"

上述的两个脚本我都跑了一下，结果如下（切换到 root 权限）：

# ./swap_stat_1.sh
PID		Swap		Proc_Name
# ./swap_stat_2.sh
Overall swap used: 0 KB

汗，这是几个意思啊。

没办法了，只能挽起袖子，看看这些个脚本到底是怎么统计的，然后手工执行一遍，看看是哪个环节出的岔子。

前面那个脚本是通过 proc 里面的 smaps 文件来查找 Swap 字眼，进行统计的，但是我手工跑了一下，java 的那个进程里面，根本就没有 Swap 这个字眼。

后面的那个脚本是通过 proc 里面的 status 文件来统计的，查找的是?VmSwap 字符串，可惜的是，在文件里面这个字符串也是没有的。

所以他们都不能统计出每个进程的交换分区用了多少。

那为什么这些文件里面都没有本来用于表示交换分区占用大小的字段呢？

正在感觉黔驴技穷的时候，无意中看到之前为了查 java 的进程号的时候敲的命令

# ps -ef|grep java
root     19899 19897  0 Sep23 pts/2    00:00:02 /usr/java/jdk1.7.0_02/bin/java -server -Xms4g -Xmx4g -Xmn2g -XX:PermSize=128m -XX:MaxPermSize=320m -XX:+UseConcMarkSweepGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:SurvivorRatio=8 -XX:+DisableExplicitGC -verbose:gc -Xloggc:/root/rmq_srv_gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -Djava.ext.dirs=/usr/local/rocketmq/bin/../lib -cp .:/usr/local/rocketmq/bin/../conf:.:/usr/java/jdk1.7.0_02/lib/tools.jar com.alibaba.rocketmq.namesrv.NamesrvStartup
root     20186  1906  0 17:04 pts/1    00:00:00 grep java

把 java 进程的命令行启动参数纵列来看

/usr/java/jdk1.7.0_02/bin/java
-server
-Xms4g
-Xmx4g
-Xmn2g
-XX:PermSize=128m
-XX:MaxPermSize=320m
-XX:+UseConcMarkSweepGC
-XX:+UseCMSCompactAtFullCollection
-XX:CMSInitiatingOccupancyFraction=70
-XX:+CMSParallelRemarkEnabled
-XX:SoftRefLRUPolicyMSPerMB=0
-XX:+CMSClassUnloadingEnabled
-XX:SurvivorRatio=8
-XX:+DisableExplicitGC
-verbose:gc
-Xloggc:/root/rmq_srv_gc.log
-XX:+PrintGCDetails
-XX:-OmitStackTraceInFastThrow
-Djava.ext.dirs=/usr/local/rocketmq/bin/../lib
-cp .:/usr/local/rocketmq/bin/../conf:.:/usr/java/jdk1.7.0_02/lib/tools.jar com.alibaba.rocketmq.namesrv.NamesrvStartup

很明显可以注意到上面有一个 -Xms4g 和 -Xmx4g，这两个我印象中是用于指定虚拟机的内存大小的。在这里，http://www.cnblogs.com/redcree…?，可以看到解释：

参数名称 含义 默认值

-Xms 初始堆大小物理内存的1/64(<1GB) 默认(MinHeapFreeRatio参数可以调整)空余堆内存小于40%时，JVM就会增大堆直到-Xmx的最大限制.

-Xmx 最大堆大小物理内存的1/4(<1GB) 默认(MaxHeapFreeRatio参数可以调整)空余堆内存大于70%时，JVM会减少堆直到 -Xms的最小限制

-Xmn 年轻代大小(1.4or lator) 注意：此处的大小是（eden+ 2 survivor space).与jmap -heap中显示的New gen是不同的。
整个堆大小=年轻代大小 + 年老代大小 + 持久代大小.
增大年轻代后,将会减小年老代大小.此值对系统性能影响较大,Sun官方推荐配置为整个堆的3/8

可以看到由于启动的时候把初始堆大小和最大堆大小都配置为 4G，所以导致 java 进程后面的问题。

到了这一步，可以初步提出一个假设的场景：在开始的时候，java 进程启动，由于初始内存和最大内存都配的太大，所以 java 进程消耗了大量的物理内存，然而，java 进程启动后，并不繁忙（通过 CPU 占用推断），所以对这些索要到手的内存也没有好好利用，这本来也相安无事。然而，后来又启动了一个进程（也就是我的业务），这个业务进程也不是吃素的，他也需要很多内存，正好 java 进程在放羊，于是 java 进程就被 Linux 系统从物理内存中驱逐出去，赶到交换分区了，然后，我的业务进程跑完之后，被我杀掉，于是我的业务进程占用的物理内存也释放出来了，这个时候如果 java 进程活跃的话，他应该是可以复辟的，可惜他太消极怠工了，于是就一直被打在冷宫，呆在交换分区一直没能出来见天日。

啊，听上去真是一个圆满的假设，但是，怎么证明我的假设呢？想了想，可以这么来：

首先，杀掉 java 进程，观察交换分区的占用是否下降，应该可以证明是否真的是 java 进程在占用交换分区。
其次，如果第一步成功了，那么我们用相同的参数重启 java 进程，应该可以看到 java 进程在占用物理内存，由于物理内存有 8G，所以应该足够 java 进程这个败家子坐吃不至于山空的。这个时候应该观察到物理内存的占用上升了，交换分区的占用还是维持在低水平。
然后，启动我的业务进程，由于我的业务进程也向系统索要内存，所以系统大量 IO，把 java 换到交换分区，这里就是我感觉到卡的原因了。这一步完成之后应该观察到我的业务进程占用了大量的物理内存，同时交换分区也被占用了。
最后，把我的进程杀掉，这个时候应该可以看到物理内存的占用下降，但是交换分区的占用依然维持高位，也就是这个文章最开始的时候提出的场景了。

秉承着打 LOL 输赢看淡想好就干的精神，立刻准备动手，动手前再次确认系统的目前状态，通过 top 命令。（这里其实有点不自洽，因为前面说了 top 的交换分区是不准的，但是没办法啊，精准的那个方法死火了，只能先这样了）

Mem:   8163812k total,   752632k used,  7411180k free,      352k buffers
Swap:  1959920k total,  1959920k used,        0k free,   425144k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  SWAP COMMAND
19899 root      22   0 4710m 234m  264 S    0  2.9   0:03.44 4.4g java

杀掉 java 进程

# kill -9 19899

看一下 free

# free -m
             total       used       free     shared    buffers     cached
Mem:          7972        512       7459          0          0        427
-/+ buffers/cache:         85       7887
Swap:         1913       1546        367

空闲的只有 367MB，占用的还有 1.5G 左右，回去看看 top

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  SWAP COMMAND
 3633 mysql     17   0  752m 1484  116 S    0  0.0   0:00.83 750m mysqld
22046 adenzhan  17   0  583m  212   36 S    0  0.0   0:00.40 583m QQActSrv
 4680 root      15   0 32752  284   88 S    0  0.0   0:01.05  31m redis-server
 2627 root      19   0 31208   20   16 S    0  0.0   0:00.02  30m sshd
 1765 root      16   0 29884 1684  784 S    0  0.0   0:01.91  27m agent
 1905 root      17   0 25436  420   16 S    0  0.0   0:00.00  24m su
28422 root      19   0 24072   88    8 S    0  0.0   0:00.00  23m smbd
28421 root      16   0 24072  196   12 S    0  0.0   0:00.01  23m smbd
 1878 root      16   0 15556  468  184 S    0  0.0   0:01.26  14m hald
 1820 messageb  15   0 14192  120   12 S    0  0.0   0:00.09  13m dbus-daemon
21756 root      16   0 13288   20   16 S    0  0.0   0:00.06  12m bash
 3663 root      15   0 13288   32   16 S    0  0.0   0:00.26  12m bash
 3670 root      17   0 13296  124   64 S    0  0.0   0:00.09  12m bash
 3673 root      16   0 13284  660   16 S    0  0.0   0:00.13  12m bash
 3672 root      17   0 13292  704   16 S    0  0.0   0:00.06  12m bash
 4115 at        16   0 12984  560  192 S    0  0.0   0:02.04  12m atd
29432 adenzhan  15   0 13828 1440   16 S    0  0.0   0:00.07  12m bash
23065 adenzhan  15   0 13828 1576   24 S    0  0.0   0:00.25  11m bash
 2637 root      17   0 12340  188  124 S    0  0.0   0:02.09  11m cron
22910 root      17   0 13116 1156   24 S    0  0.0   0:00.12  11m bash
 1906 root      16   0 13292 1896  596 S    0  0.0   0:00.13  11m bash
 2658 root      25   0 11384   16   12 S    0  0.0   0:00.04  11m mysqld_safe
22045 adenzhan  16   0 10488  192   28 S    0  0.0   0:00.03  10m QQActSrv
 2338 root      15   0 10360  340  196 S    0  0.0   0:03.29 9.8m syslog-ng

前面两个占了大头，后面的虾兵蟹将，加起来差不多 1676MB，也差不多能对上。

然后就是重启 java 进程了。

# /usr/java/jdk1.7.0_02/bin/java -server -Xms4g -Xmx4g -Xmn2g -XX:PermSize=128m -XX:MaxPermSize=320m -XX:+UseConcMarkSweepGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:SurvivorRatio=8 -XX:+DisableExplicitGC -verbose:gc -Xloggc:/root/rmq_srv_gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -Djava.ext.dirs=/usr/local/rocketmq/bin/../lib -cp .:/usr/local/rocketmq/bin/../conf:.:/usr/java/jdk1.7.0_02/lib/tools.jar com.alibaba.rocketmq.namesrv.NamesrvStartup
The Name Server boot success.

再看一下内存

Mem:   8163812k total,   629772k used,  7534040k free,     2524k buffers
Swap:  1959920k total,  1583124k used,   376796k free,   484348k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  SWAP COMMAND
31368 root      18   0 4657m  62m  10m S    0  0.8   0:00.83 4.5g java

悲剧，完全不按照套路出牌。java 进程已经重新启动了，但是物理内存还是空空如也，交换分区也没有被占用。

此路不通，看来之前的问题还是绕不过去，那就是在交换分区里面的，到底是什么在占用呢。

为了验证那些空闲的物理内存是不是真的可以用，我写了一个小代码测试：

#include <stdio.h>
#include <time.h>
#include <string>
#include <map>

int main() {
    int i = 0;
    while (1) {
        int n = 100 * 1024 * 1024;
        char* p = (char*)malloc(n);
        for (int j = 0; j < n; j++) {
            p[j] = 0;
        }
        printf("%d\n", i);
        i++;
        sleep(1);
    }
    return 0;
}

跑起来之后用 vmstat 进行持续的观察

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0 1958772 7174560 222488 185668    0    0     0     0  264  270  6  0 93  0  0
 1  0 1958772 7095324 222488 185668    0    0     0     0  263  266  6  0 94  0  0
 0  0 1958772 7072012 222488 185668    0    0     0     0  258  254  2  0 98  0  0
 0  0 1958772 6969340 222488 185668    0    0     0     0  259  266  7  0 92  0  0
 1  0 1958772 6924700 222488 185668    0    0     0     0  261  259  3  0 97  0  0
 0  0 1958772 6866792 222496 185660    0    0     0    16  271  269  4  0 95  0  0
 0  0 1958772 6764740 222496 185660    0    0     0     0  260  261  7  0 92  0  0
 1  0 1958772 6754572 222496 185660    0    0     0     0  260  257  1  0 99  0  0
 0  0 1958772 6662068 222496 185660    0    0     0     0  262  262  6  0 93  0  0
 1  0 1958772 6584196 222496 185660    0    0     0     0  263  266  6  0 94  0  0
 0  0 1958772 6559520 222496 185660    0    0     0     0  266  262  1  0 98  0  0
 0  0 1958772 6456848 222496 185660    0    0     0     0  262  264  7  0 93  0  0
 1  0 1958772 6413572 222496 185660    0    0     0     0  255  257  3  0 97  0  0
 0  0 1958772 6354424 222496 185660    0    0     0     0  260  261  4  0 96  0  0
 0  1 1959920 6251876 222496 184512    0    0     0     0  264  267  7  0 90  2  0
 1  1 1959920 6245952 222496 182456    0    0     0     0  265  286  1  0 87 13  0
 0  1 1959920 6281508 134508 140916    0    0     0     0  257  276  7  1 80 12  0

可以看到物理内存是确实在下降的，这说明那片内存是真的可以使用。

在继续搜索的过程中，看到这里，http://www.cnblogs.com/kumulin…?，提到

内存不足无疑会SWAP，但有些时候，即便看上去内存很充裕，还可能会SWAP，这种现象被称为SWAP Insanity

下面讲了一些 MySQL 的案例，例如：

NUMA的诅咒

NUMA在MySQL社区有很多讨论，这里不多说了，直击NUMA和SWAP的恩怨纠葛。

大概了解一下NUMA最核心的numactl命令：

shell> numactl –hardware
available: 2 nodes (0-1)
node 0 size: 16131 MB
node 0 free: 100 MB
node 1 size: 16160 MB
node 1 free: 10 MB
node distances:
node 0 1
0: 10 20
1: 20 10
可以看到系统有两个节点（其实就是两个物理CPU），它们各自分了16G内存，其中零号节点还剩100M内存，一号节点还剩10M内存。设想启动了一个需要11M内存的进程，系统把它分给了一号节点来执行，此时虽然系统总体的可用内存大于该进程需要的内存，但因为一号节点本身剩余的可用内存不足，所以仍然可能会触发SWAP行为。

需要说明的一点事，numactl命令中看到的各节点剩余内存中时不包括Cache内存的，如果需要知道，我们可以利用drop_caches参数先释放它：

shell> sysctl vm.drop_caches=1
注：这步操作可能会引起系统负载的震荡。

另：如何确定一个进程的节点及内存分配情况？网络上有现成的脚本。

如果要规避NUMA对SWAP的影响，最简单的方法就是在启动进程的时候禁用它：

shell> numactl –interleave=all …
此外，内核参数zone_reclaim_mode通常也很重要，当某个节点可用内存不足时，如果为0的话，那么系统会倾向于从远程节点分配内存；如果为1的话，那么系统会倾向于从本地节点回收Cache内存。多数时候，Cache对性能很重要，所以0是一个更好的选择。

病急乱投医，记得之前在 top 排名里面也有看到 MySQL，这不禁让我们开始疑神疑鬼起来。

依葫芦画瓢，我们来看一下：

# numactl --hardware
libnuma: Warning: /sys not mounted or no numa system. Assuming one node: No such file or directory
available: 1 nodes (0-0)
node 0 size: <not available>
node 0 free: <not available>
libnuma: Warning: Cannot parse distance information in sysfs: No such file or directory
No distance information available.

这是要吐血的节奏啊。不过，虽然输出里面写着各种 not available，但是，我们起码可以知道我们是 1 node，这样的话，应该就跟 NUMA 没什么太大关系了，因为 NUMA 是一个多核多 CPU 架构上的问题。

看到这儿，各位看官可能要问了，咋么不直接把 MySQL 也杀掉啊，宁可错杀一千不要放过一个，说不定杀掉就好了啊，唉，一把辛酸泪，这个是个共用的开发机啊，其实那个 java 进程和这个 MySQL 进程都不是我的，之前我杀 java 进程的时候也是抱着一种做贼一样的心态偷偷摸摸杀了赶紧重启的，不过，事到如今没办法了，一不做二不休，把 MySQL 也干掉，我们再看看。

在杀掉 MySQL 之前，再次确认现在的现状，我再次把我的业务进程拉起，然后通过 vmstat 来观察机器运行状态。

分两段来看，这个是在编译期间的统计过程

 2  0 1013384 6359004  88980 1306032    0    0   116     0  274  351 23  1 75  0  0
 2  1 1013384 6685740  12668 1124360    0    0  1464    32  326  316 24  2 68  7  0
 2  1 1013384 6774932  12696 1124616    0    0   476     0  307  340 23  1 63 13  0
 1  3 1269272 6905620  12124 821184    0    0    92 11668  468  569 21  4 61 14  0
 2  1 1348368 6863208  12208 742004    0    0  1728     0  362  424 16  2 64 18  0
 0  2 1600160 7064960  12296 478072    0 106720   212 106848  475  479  7  5 78 10  0
 0  3 1653436 7122228  12356 424544    0 62032   780 62048  457  454  0  2 79 19  0
 0  4 1686524 7159148  12364 390036    0 38312  1152 38348  434  455  0  1 76 22  0
 0  5 1697276 7162764  12372 380728    0 17100  2852 19584  382  408  0  1 80 18  0
 0  5 1736972 7206564  12308 341244    0 69936   564 70636  437  349  0  1 82 16  0
 0  6 1752928 7240816  12372 322480    0 22624  1192 22624  401  421  0  1 67 32  0
 0  7 1755268 7239972  12488 326592    0 3976  6204  3980  670 1423  2  1 57 40  0
 0  8 1758032 7286824  12368 325028    0 4048  1108  4088  474 1374  1  2 61 37  0
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  5 1760724 7283756  12164 325752    0 5076  2520  5076  449 1305  1  1 61 37  0
 0  3 1766428 7315792  12188 326272    0 10028  1632 10088  553 1290  1  1 75 23  0
 0  3 1773476 7332700  12244 319736    0 15408    52 15480  592  380  0  1 80 19  0
 0  2 1783144 7360904   4196 300976    0 21268    60 21272  538  389  0  1 88 11  0
 0  3 1790256 7381244   4264 294616    0 14516    68 14592  495  412  0  1 82 18  0
 0  4 1796104 7393724   4332 287880    0 10960    68 18608  464  421  0  1 75 24  0
 0  4 1802432 7406564   4412 281944    0 11920    64 12140  495  406  0  1 81 18  0
 0  2 1816432 7421220   4444 267332    0 19328    28 19332  511  355  0  0 81 19  0
 0  2 1843756 7457948   2416 232680    0 34324    80 34324  541  452  0  1 87 11  0
 0  2 1844524 7477880   2484 231908    0 1744    68  1744  444  399  0  0 87 12  0
 0  3 1868084 7503132   2568 208124    0 31556    84 31648  706  441  0  2 83 15  0
 0  2 1878660 7513636   2628 197612    0 11148    60 11148  487  397  0  1 81 18  0
 0  4 1887364 7526612   2692 189704    0 9716  1080  9716  500  464  0  1 83 16  0
 0  2 1892252 7527292   2740 190488    0 11036  2096 11036  497  762  2  1 66 31  0
 0  3 1905636 7542664   2760 176580    0 15116   104 15148  529  539  0  1 85 14  0
 0  4 1911460 7545984   2836 170656    0 6472   372  9288  507  507  0  1 80 19  0
 0  5 1919496 7556060   2896 162600    0 8696   128  8944  583  447  0  1 77 22  0
 0  4 1940944 7574196   3028 140100    0 26088  1268 26088  830 1343  0  2 77 20  0
 0  4 1940944 7562912   3096 143648    0 6492  3492  6492  592  874  3  1 80 16  0
 0  2 1945880 7542348   3156 138988    0 5752   484  5752  488  617  1  1 80 18  0
 0  5 1958360 7560328   3200 124944    0 18068   224 18092  563  609  0  1 73 26  0
 0  4 1958852 7571568   3208 120276    0 5344    44  5348  478  432  0  0 66 33  0
 0  3 1959920 7582708   3216 119404    0 20468   264 20568  628  655  0  1 69 29  0
 0  4 1959920 7593388   3244 119116    0 12704    88 12780  482  466  0  1 68 31  0
 1  2 1959920 7551580   3312 119048    0    0  1328     0  425 1336  2  1 71 26  0
 3  0 1959920 7352008   3364 120024    0    0  1492     0  437  754 26  3 64  7  0
 3  0 1959920 7090496   3380 118980    0    0    24     0  262  352 36  1 63  0  0
 3  0 1959920 7162064   3412 122032    0    0   488     0  291  416 33  2 63  2  0

block in/out 和用户态的 CPU 占用都很好理解，编译器需要从磁盘把源文件，或者中间过程的目标文件读进来，分析生成后，把结果文件写回磁盘，所以有产生一定的磁盘 IO 伴随着用户态的 CPU 占用是很合情合理的。

但是在这个过程中，不正常的是发生了大量的 swap out，大量的数据从内存换出到交换分区，而与此同时，物理内存的空闲其实很大，这一点就是症结所在了。

下面再来看业务进程跑起来之后的统计。

 1  0 1959920 7310056   5308 361716    0    0     0     0  256  271 12  0 88  0  0
 1  0 1959920 7250908   5308 361716    0    0     0     0  263  278 12  0 87  0  0
 1  1 1959920 7238144   5320 361704    0    0     0 26556  549  284 12  1 77 10  0
 0  1 1959920 7397820   5624 370652    0    0   604     0  336 1094  8  2 86  4  0
 1  0 1959920 7266440   7416 380168    0    0 11808    24  721 1217  2  1 87 10  0
 0  1 1869272 7258256   7416 471052 71724    0 71724     0 2512 4778  2  5 87  5  0
 0  1 1801124 7200844   7416 539488 57280    0 57280     0 2048 3858  0  4 88  8  0
 0  3 1748096 7180076   7428 594000 21120    0 21164  8692 1890 3519  0  3 77 20  0
 0  4 1741260 7166096   7460 603176 6880    0 10720   356  960 2134  0  1 82 17  0
 0  4 1712208 7130728   7536 632928 30064    0 31244    68 3146 9281  1  5 71 22  0
 0  2 1683784 7100300   7660 660832 31236    0 32644    64 3231 7258  0  5 84 11  0
 0  1 1626832 7042160   7684 716336 63136    0 63204     0 2766 5449  0  5 87  8  0
 0  1 1570020 6978300   7684 773016 63868    0 63868     0 2284 4316  0  4 88  8  0
 0  1 1533192 6944836   7704 811304 33440    0 33444    96 1496 2651  0  3 87 10  0
 0  1 1402664 6879132   7704 941080 65516    0 65516     0 2329 4389  1  4 88  8  0
 0  1 1372672 6841724   7712 971796 37304    0 37304    56 1569 2799  0  3 87 10  0
 1  0 1330788 6797952   7712 1013560 44356    0 44356     0 1689 3114  0  2 88 10  0
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  1 1284572 6761992   7712 1059876 35848    0 35848     0 1421 2588  0  2 87 10  0
 0  1 1234892 6723428   7712 1109148 38444    0 38444     0 1488 2711  0  3 87 10  0
 1  1 1190096 6571444   7788 1179532 36016    0 36100     0 1466 5867  6  3 82  9  0
 0  1 1146008 6503228   7840 1256876 30428    0 33832    24 1909 7419  5  3 80 11  0
 0  1 1115240 6460324   7840 1287056 37596    0 37596     0 1443 8404  3  2 84 10  0
 2  0 1022508 6424984   7840 1380024 30636    0 30636     0 1225 7257  3  3 85 10  0
 0  0 1013384 6403092   7852 1393088 3212    0  3212     0  368 2449  9  0 88  3  0
 0  1 1013384 6401604   7852 1393088    0    0     0 72076  439 1327  0  1 95  4  0
 0  1 1013384 6403588   7872 1394096    0    0     0   328  317 1526  0  0 86 13  0
 0  0 1013384 6403588   7872 1394096    0    0     0     0  260 1568  0  0 100  0  0
 0  0 1013384 6403736   7872 1394096    0    0     0     0  261 1563  0  0 100  0  0
 0  0 1013384 6403736   7872 1394096    0    0     0     0  258 1570  0  0 100  0  0

伴随着业务进程的启动，大量的数据又从交换分区换入到了内存，这个时候，也就是业务进程感觉到明显卡顿的原因了。

下面杀掉 MySQL。

# ps -ef|grep mysql
root      2658     1  0 Sep23 ?        00:00:00 /bin/sh /usr/local/mysql/bin/mysqld_safe --datadir=/data/dbdata/data --pid-file=/data/dbdata/mysql.pid
mysql    17007  2658  0 Oct10 ?        00:00:00 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/data/dbdata/data --plugin-dir=/usr/local/mysql/lib/plugin --user=mysql --log-error=/data/dbdata/mysql_error.log --open-files-limit=10240 --pid-file=/data/dbdata/mysql.pid --socket=/data/dbdata/mysql.sock --port=3306
root     30613 30597  0 10:43 pts/6    00:00:00 grep mysql

# kill -9 17007
# kill -9 2658

发现他好像会自动重启

# ps -ef|grep mysql
mysql    31041     1  0 10:45 ?        00:00:00 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/data/dbdata/data --plugin-dir=/usr/local/mysql/lib/plugin --user=mysql --log-error=/data/dbdata/mysql_error.log --open-files-limit=10240 --pid-file=/data/dbdata/mysql.pid --socket=/data/dbdata/mysql.sock --port=3306
root     31361 30597  0 10:46 pts/6    00:00:00 grep mysql

再杀一次

# kill -9 31041

这下可以了

# ps -ef|grep mysql
root     31435 30597  0 10:47 pts/6    00:00:00 grep mysql

顺便看一下 free

# free -m
             total       used       free     shared    buffers     cached
Mem:          7972       1557       6415          0          9       1377
-/+ buffers/cache:        169       7802
Swap:         1913        877       1036

嗯，交换分区降下去了，那么我们再次编译和运行业务进程，同时用 vmstat 进行观察。

现象一如之前，编译的时候产生大量换出

 2  1 980432 6860160   1604 990932    0    0     0     0  261  303 24  1 62 13  0
 0  2 1017904 6974216   1696 963716    0    0  4652     8  447  520 10  1 76 13  0
 0  2 1127556 7007912   2252 843568    0 16240   768 16256  462  693  0  1 76 24  0
 0  6 1322852 7051004   2276 642040    0 28176  2888 40660  538  666  1  3 71 26  0
 0  5 1643840 7070812   2000 318864    0 55128  1528 55128  439  481  1  6 50 43  0
 0  3 1685084 7102164   1820 278164    0 35564   364 35568  504  508  0  1 73 25  0
 1  4 1744040 7126012   1864 218672    0 38428   148 38716  411  473  0  2 79 19  0
 0  5 1784564 7171124   1876 173576    0 52944    72 52956  416  357  0  2 71 27  0
 1  5 1795480 7174312   1968 163804    0 44836   640 46732  463  635  1  1 69 29  0
 0  6 1803120 7233380   2040 157464    0 24212   968 24220  420  451  2  1 80 17  0
 0  6 1812772 7273884   2100 147456    0 35412   184 35552  465  480  1  1 78 20  0
 0  5 1814572 7278328   2120 144904    0 6404   100  6404  482  398  0  1 76 23  0
 0  4 1815704 7295964   2152 143852    0 4148   104  4152  517  379  0  1 75 24  0
 0  3 1817916 7304692   2180 142188    0 11632   228 11640  559  546  0  1 76 23  0
 3  4 1820684 7311924   2176 137488    0 12860   180 12972  529  563  0  1 73 26  0
 0  6 1822180 7321416   2192 135088    0 6184    56  6260  461  345  0  0 76 24  0
 1  4 1829204 7345044   2208 128292    0 30316   208 30320  689  568  0  1 73 26  0
 1  3 1834580 7358464   2212 124096    0 15564   264 15564  530  570  0  1 74 25  0
 1  1 1837560 7375300   2220 120404    0 16584   312 16584  518  589  0  1 74 25  0
 1  4 1842588 7392688   2232 115984    0 16668   148 16684  555  544  0  1 75 24  0
 0  5 1846988 7384104   2248 111004    0 20672   184 20672  460  479  0  1 73 26  0
 0  5 1847640 7406224   2268 110520    0 24800   184 24824  475  500  0  1 76 23  0
 2  4 1856736 7309712   2304 101772    0 20768   564 20768  486  574 16  2 60 22  0
 1  4 1861856 7258344   2340  97444    0 5552   268  5552  551  676 11  2 57 30  0
 2  3 1864524 7121924   2364  93280    0 21484   200 21484  537  539 20  2 63 15  0
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  5 1864524 7325636   2404  96488    0 9548   288  9552  494  534 21  2 60 17  0
 1  5 1864676 7427988   2416  98824    0 7944   228  7956  480  366 12  1 64 23  0
 2  4 1864676 7191512   2444  98928    0 15584   308 19512  619  520 19  1 59 20  0
 1  5 1865156 7196544   2476  97364    0 19000    36 19164  504  351 10  2 59 30  0
 2  2 1865352 7401708   2488 101748    0 16784     4 16784  769  332 29  2 53 16  0
 3  1 1869596 7254136   2504  94632    0 31984    88 31984  600  558 27  3 55 15  0
 3  2 1872636 6989796   2524  92624    0 21340   260 21344  534  514 28  3 51 18  0
 3  3 1874220 6763748   2540  90472    0 12092     0 15864  618  292 36  2 43 20  0
 3  1 1878596 6885760   2552  89196    0 35932   720 35936  666  374 34  2 43 21  0
 4  0 1882364 6784592   2552  79056    0 14960     0 14960  610  284 36  2 50 12  0
 3  1 1886068 6999564   2428  75392    0 20904    16 20908  618  342 35  2 50 13  0
 3  1 1891016 7095012   2420  75320    0 24708   156 24708  607  394 33  3 51 14  0
 2  5 1897472 7224532   2472  83056    0 26692   300 40684  602  462 34  2 40 23  0
 1  4 1899236 7621272   2480  74012    0 10172   104 10188  645  429 20  2 29 48  0
 3  0 1899236 7153676   2508  73984    0    0    16   104  301  274 28  2 47 23  0
 3  0 1899236 7022244   2512  73980    0    0    20     0  261  298 35  2 63  0  0

启动运行的时候产生大量换入

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 1899236 7403648   4660 297992    0    0     0     0  259  223 12  0 88  0  0
 0  1 1899236 7565008   4976 306928    0    0   436     0  330 1062  9  1 88  3  0
 1  0 1899236 7436296   6792 316420    0    0 11856     0  725 1185  2  1 87 10  0
 0  1 1844588 7368352   6792 371932 64928    0 64928     0 2289 4303  2  4 88  6  0
 0  1 1792108 7365004   6800 424200 67024    0 67024    48 2922 5579  0  5 87  8  0
 0  3 1777704 7346276   6820 439524 18592    0 18616  8844 1469 2551  0  2 84 14  0
 1  2 1754960 7317704   6948 464576 24292    0 26588     0 2535 5185  0  3 84 12  0
 1  1 1732796 7292252   7020 486816 23500    0 24356     0 2402 4492  0  2 87 10  0
 1  0 1716448 7267476   7060 503880 28056    0 28448     0 2249 3774  0  3 87 10  0
 0  1 1673680 7222296   7060 546672 45176    0 45176     0 1681 3091  0  2 88 10  0
 0  1 1647400 7188072   7060 572704 34172    0 34172     0 1339 2382  0  2 87 10  0
 1  0 1612980 7144300   7060 606688 43644    0 43644     0 1631 2995  0  2 88 10  0
 0  1 1576360 7094080   7060 643304 50344    0 50344     0 1864 3426  0  4 87  9  0
 0  1 1540148 7044420   7060 680128 49760    0 49760     0 1863 3430  0  3 88  9  0
 0  1 1511728 7005112   7068 708428 39176    0 39176    28 1584 2837  0  3 87 10  0
 0  1 1475812 6960720   7068 744108 44440    0 44440     0 1650 3028  0  3 87  9  0
 0  1 1445400 6925256   7068 774252 35256    0 35256     0 1377 2470  0  2 88 10  0
 1  0 1416776 6895156   7068 803572 30140    0 30140     0 1255 2221  0  2 87 10  0
 1  0 1392580 6871472   7068 827836 23576    0 23576     0  998 1716  0  2 88 10  0
 0  1 1364792 6832536   7068 854968 38692    0 38692     0 1481 2677  0  3 87 10  0
 0  1 1336872 6794964   7068 883056 37836    0 37836     0 1446 2603  0  2 87 10  0
 0  1 1277036 6770660   7068 943312 24252    0 24252     0 1025 1774  0  2 88 11  0
 0  1 1248952 6740528   7076 971268 29932    0 29932    24 1306 2309  0  2 86 12  0
 0  1 1206072 6708412   7076 1014120 31896    0 31896     0 1270 2254  0  1 88 11  0
 0  2 1152052 6578112   7116 1068260 43016    0 43080     0 1623 4425  3  3 85  8  0
 0  2 1129556 6508608   7204 1136560 18872    0 18976     0 1370 3832  5  3 79 13  0
 2  0 1084612 6469112   7224 1199936 19616    0 22968     0 1187 6305  3  2 84 11  0
 0  1 1057784 6442944   7248 1226144 22152    0 22276     0 1048 5929  2  2 85 11  0
 0  1 967744 6424096   7248 1316964 15668    0 15696     0  892 5197  1  2 86 11  0
 1  1 898092 6394212   7260 1386324 14672    0 14672 37396 1257 5781  8  3 83  6  0
 0  1 898092 6387144   7288 1389380    0    0    12 31572  481 1601  3  1 84 13  0
 0  0 898092 6387144   7288 1389380    0    0     0     0  262 1582  0  0 93  7  0

看来杀 MySQL 也没有用啊。

继续搜，看到这里，http://askubuntu.com/questions…?，提到一个叫 swappiness 的东西

What is swappiness and how do I change it?

The swappiness parameter controls the tendency of the kernel to move processes out of physical memory and onto the swap disk. Because disks are much slower than RAM, this can lead to slower response times for system and applications if processes are too aggressively moved out of memory.

swappiness can have a value of between 0 and 100

swappiness=0 tells the kernel to avoid swapping processes out of physical memory for as long as possible

swappiness=100 tells the kernel to aggressively swap processes out of physical memory and move them to swap cache

The default setting in Ubuntu is swappiness=60. Reducing the default value of swappiness will probably improve overall performance for a typical Ubuntu desktop installation. A value of swappiness=10 is recommended, but feel free to experiment. Note: Ubuntu server installations have different performance requirements to desktop systems, and the default value of 60 is likely more suitable.

To check the swappiness value

cat /proc/sys/vm/swappiness

To change the swappiness value A temporary change (lost on reboot) with a swappiness value of 10 can be made with

sudo sysctl vm.swappiness=10

To make a change permanent, edit the configuration file with your favorite editor:

gksudo gedit /etc/sysctl.conf

Search for vm.swappiness and change its value as desired. If vm.swappiness does not exist, add it to the end of the file like so:

vm.swappiness=10

Save the file and reboot.

然而，当我查看我的环境的这个变量的时候，却是这个样子

$ cat /proc/sys/vm/swappiness
10

这是在逗我吗。。

到此为止，好像走进了一个绝路，那么我们回头整理一下思路，从最开始的发现物理内存为空，交换分区满，于是我们去查什么东西占用交换分区，用了各种或直接或间接的方法查，然后逐个杀掉了 java 和 MySQL，到这个时候，我们依然相信系统对于内存的管理是没有异常的，而是由于我们的某些特殊场景，导致了物理内存空交换分区满的现象，然而，在 java 和 MySQL 都先后杀掉以后，物理内存和交换分区都腾空出来了，再次操作依然可以发现系统在物理内存空的时候大量操作交换分区，到这个时候，我们就开始怀疑系统对于内存的管理出现了问题，然而，通过一些简单的排查，发现系统的 swappiness 分配参数也是没有问题的，那么，问题出在哪里呢。

在刚刚那个页面，继续往下看，有看到用 swapoff 和 swapon 来开关交换分区的，这招可谓是釜底抽薪，那我们来试一下，swapoff -a 同时，用 vmstat 进行观察

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  2 420968 6591000    636 1468136    3    3     8     6    0    0  0  0 99  0  0
 0  2 400720 6570968    440 1488896 20248    0 20252     0 5327 10388  0 10 75 15  0
 0  2 377560 6549048    388 1510540 23156    0 23164     4 6055 12964  0 10 74 16  0
 1  1 350740 6522324    368 1537300 26812    0 26812     8 6966 14147  0 11 75 14  0
 1  1 321720 6493740    368 1565100 28984    0 28992    56 7523 16173  0 10 74 16  0
 1  1 295620 6468000    372 1590800 26096    0 26100     0 6788 13731  0 11 74 15  0
 0  2 269620 6449592    376 1611356 26000    0 26432     4 6778 14532  0 11 74 15  0
 1  1 242364 6422372    372 1638092 27252    0 27832     4 7099 15180  0 12 74 14  0
 0  4 222304 6400124    400 1658648 20032    0 24044     0 5331 11827  0  9 74 17  0
 1  1 199892 6378832    396 1680272 22384    0 22388    28 5871 12186  0 10 76 14  0
 1  1 178436 6360520    392 1697836 21372    0 22724     4 5627 11596  0 10 76 14  0
 1  1 150812 6334388    376 1724580 27624    0 28172     0 7181 14521  0 10 75 15  0
 1  1 120884 6304084    380 1753372 29916    0 31000     8 7767 17782  0 10 73 17  0
 1  1  82764 6267040    380 1791408 38120    0 38120     0 9790 19308  0 10 75 15  0
 1  1  48192 6232144    380 1826360 34572    0 35068     0 8912 18106  0 12 74 13  0
 1  1  15452 6198328    384 1859264 32728    0 34568     0 8487 19192  0 10 72 18  0
 0  1      0 6180212    416 1877740 15444    0 20384     0 4233 8065  0  5 79 16  0
 0  2      0 6177676    556 1879656    0    0  7076    24  531  509  0  1 81 18  0
 2  3      0 6168072    868 1884484    0    0  7232     0  486 1164  0  1 78 21  0
 1  0      0 6167860    956 1889536    0    0  4792     0  372 1166  0  1 81 18  0
 0  1      0 6176860    664 1880576    0    0   836     0  276  258  0  1 87 12  0

可以看到系统大量的 IO，把交换分区里面的东西统统挪到了物理内存上，完了之后，我们看一下 free

$ free -m
             total       used       free     shared    buffers     cached
Mem:          7972       1936       6036          0          0       1829
-/+ buffers/cache:        106       7866
Swap:            0          0          0

嗯，我们还有 6G 的物理内存，那么重新编译和运行业务进程，看看是个什么情况

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  1      0 5468392    916 1901912    0    0   268     4  305  406 31  2 52 14  0
 3  1      0 5198028    924 1900876    0    0   124     0  276  267 36  1 50 13  0
 3  2      0 5613288    844 1897872    0    0   740 10320  431  521 31  3 49 17  0
 3  1      0 5308244    812 1898932    0    0   412     4  276  287 32  2 52 14  0
 3  1      0 5112304    816 1897900    0    0   504     0  318  278 36  1 50 13  0
 3  1      0 5824276    832 1906108    0    0  2312     8  390  461 29  4 52 15  0
 3  1      0 5386064    784 1904100    0    0   428     8  282  258 36  2 50 13  0
 2  2      0 5383212    724 1903132    0    0   220  7544  350  409 34  2 46 17  0
 3  1      0 5608280    712 1906228    0    0  1984     8  367  403 33  3 50 14  0
 3  1      0 5576976    768 1905144    0    0   464     8  309  364 35  1 49 14  0
 3  1      0 5271180    804 1907164    0    0    56     0  275  285 35  1 50 14  0
 0  4      0 6092084    828 1905084    0    0  2572     8  437  750 23  4 53 21  0
 2  3      0 5932212    844 1899928    0    0  1000 11032  444 1112  7  2 66 25  0
 0  3      0 6045028    980 1901848    0    0  2532     0  381  619  9  1 64 26  0
 4  0      0 5863828   1064 1898680    0    0  1124    44  393  770  7  1 65 27  0
 3  1      0 5564052   1032 1900768    0    0  3088     0  337  374 35  3 50 12  0
 3  1      0 5625928   1096 1903788    0    0   440     4  326  399 34  2 52 12  0
 3  1      0 5571116    808 1898936    0    0   716  3508  381  431 34  2 47 17  0

可以看到编译过程中，内存占用大约在 500-800MB 左右波动，这个可以理解和接受，然后看启动起来

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  1      0 6180156    384 1875716    3    3     8     6    0    0  0  0 99  0  0
 0  1      0 6180336    404 1876724    0    0   788     0  309  292  0  0 87 12  0
 1  2      0 6178192    628 1876500    0    0  3988     0  449  521  0  0 86 14  0
 0  2      0 6171380    652 1882644    0    0 10388     0  576  621  0  1 83 15  0
 0  3      0 6096204    720 1893884    0    0 11724     0  395  480  5  2 80 13  0
 1  1      0 5972932    960 1929624    0    0  2836     4  411 2259  5  2 73 20  0
 0  3      0 5913788   1308 1971424    0    0  9456     0  496 3027 10  1 67 21  0
 0  3      0 5900968   1112 1969564    0    0  4260 42588  576 1657  8  1 73 18  0
 0  6      0 5903616   1028 1965536    0    0  2940 26108  512 1623  0  1 74 25  0
 0  8      0 5946120    800 1924644    0    0  2276     0  368 1297  0  1 81 18  0
 1  3      0 5966924    844 1904040    0    0  2208     0  353 1731  0  1 82 18  0
 0  2      0 5980372   1024 1895636    0    0  4060   388  352 1801  0  1 80 20  0
 0  1      0 5983440   1032 1894600    0    0  1404    44  333 1690  0  0 86 13  0
 0  1      0 5984280   1048 1892528    0    0  1136  3556  310 1663  0  0 86 13  0
 0  1      0 5984124   1036 1892540    0    0   920    36  287 1641  0  0 88 12  0
 0  1      0 5989944    920 1887516    0    0   408   136  304 1651  0  0 88 12  0
 0  1      0 5991924    908 1886500    0    0     8     4  262 1646  0  0 87 12  0
 0  1      0 5992576    756 1885624    0    0  1060    96  329 1635  0  0 87 13  0
 0  1      0 5996540    724 1881544    0    0     0    28  270 1553  0  0 88 12  0
 0  1      0 5997800    488 1880752    0    0     0    40  265 1528  0  0 87 12  0
 0  1      0 5998040    416 1879796    0    0     0    24  266 1542  0  0 87 12  0

进程启动起来大约只耗用 200MB 的内存，而且关键的是，现在业务进程不卡了，这个是一个重大突破啊。

随后发现一个问题，负载持续在 1 左右，下不去

# uptime
  3:53pm  up 21 days  5:08,  11 users,  load average: 1.65, 1.20, 1.11

看了一下 top，是一个叫 kswapd0 的进程在跑 CPU，查了一下，看到这里，http://my.oschina.net/davehe/b…?，说是这个是一个管理虚拟内存的，那应该也可以一并干掉

# ps -ef|grep swap
root       368     1  0 Sep23 ?        00:18:43 [kswapd0]
root     28397 30597  0 16:03 pts/6    00:00:00 grep swap
# kill -9 368

发现杀不掉，那个进程的名字带了中括号，看这里，http://www.cnblogs.com/vamei/a…?，

第三列有一些由中括号[]括起来的。它们是kernel的一部分功能，被打扮成进程的样子以方便操作系统管理。我们不必考虑它们。

真是个悲剧。

不过，问题到这里，其实只是治标不治本，本着 no zuo no die 的精神，我们把交换分区再次打开。

# swapon -a

一打开，立刻交换分区就用上了

$ free -m
             total       used       free     shared    buffers     cached
Mem:          7972        669       7303          0          1        137
-/+ buffers/cache:        530       7441
Swap:         1913       1721        192

同时，系统负载也下来了

# uptime
  4:26pm  up 21 days  5:42,  11 users,  load average: 0.06, 0.59, 0.88

感觉就像是 kswapd0 卯足了劲想把物理内存里面的东西搬到交换分区去，当交换没门的时候他一直耗着 CPU，门一开他立刻撒了欢的搬东西到交换分区，然后就心满意足的歇息了。

药不能停啊。。

不管怎么说，有了新的线索，那么我们换个方向，重新放狗，搜出来第一条，http://www.gossamer-threads.co…?，就说：

Q1: Why does the kswapd0 process from time to time take up 100% CPU?

I can’t answer this, other than say that something is using up your memory and the kernel is processing heavily (assessing) the number of free pages in the system to see if they are getting too low and therefore it should start swapping pages out into your swap partition. It runs on a timer, so it will check this every now and then and therefore you will find it fluctuates over time.

在这里，http://forums.debian.net/viewt…?，一大坨同学们在热烈的讨论，这位哥们遭受的情况可比我悲催多了，被卡住都是 hours 的，卡到连终端都登不上，而且在 swap 的时候还会 swap 到系统崩溃掉，真是可怜，来看看他具体是怎样的：

– The system starts swapping highly erratic, no matter what swappiness value is set. On the one hand 99% of memory can be in use without swapping for a long time but on the other it suddenly swaps when only using a fraction of it.
– When it swaps ist swaps almost everything. Sometimes really everything. The memory usage will drop to a few percent. After that the process is reversed and the swapped data is read into memory again (at least parts of it). Until there’s about the same percentage of ram and swap in use.
– It seems to swap the cached/buffered data too (but it might only seem that way)
– After standby it almost always start so swap within the first minute. No matter how much memory is actually used. So before going into standby I have to kill most of the programms if I don’t want to wait hours (no kidding, it’s really hourse. See next point) for it to finish. It does so even if standby mode lasted for only 1 second. Maybe the system gets confused and thinks the data is all very old?
– Swapping uses a lot, really a lot of resources and freezes the whole computer. Happend to only swap about 100 MB but the pc was frozen for an hour with the swap process using all cpu and all IO resources – even on new hardware. I don’t know what it does, but I can encrypt a whole harddisk in a fraction of this time. Even login on console isn’t possible anymore (timeout). After waking up from standy this results in a black screen until it’s finished (which may take half an hour, 1 hour or even forever, see below).
– The swap process starts without even using swap. I’ve currently no swap partition/file in use and even turned swappiness to 0. Still the process kicks in from time to time, using a lot of cpu and 99% of IO (according to iotop), slowing down the whole PC. Even happens with 2GB of 3,5GB Ram unused (except for cache). This espacially happens when I’m copying some files. I guess the swap process kicks in since it thinks that all the cached/buffered data is actual memory usage and wants to swap it. But since there’s no swap the process randomly access all my harddisks. Don’t know why, maybe it’s looking for something swap-like.
– Sometimes the swap process seems to be cought in some form of loop. Even without swap being installed. The process is taking all of the recourses and won’t stop. There’s no other way than a hardcore shutdown (unplugging power or pressing power button for 4 secs). Even the magic SysRq key fails.
– While it swaps parts of the system may crash, like the file browser or even kernel modules. Very often I have to manually modprobe psmouse to use the mouse pointer again since it stays frozen.

真是有够惨的，一楼的回复建议重新编译内核，这。。顺道记录一下我的环境的版本，是 tlinux 的 2.6.16.60-0.21 的 64 位版本。

这么看来，就应该是一个内核的问题了。于是开始寻求 tlinux 开发支持同学的帮助。

一开始，tlinux 的同学也提出了跟我最初类似的想法，就是交换分区的东西是被一些特殊场景“挤”到虚拟内存，然后又一直没有被 touch，所以一直留在物理内存里面，但是我提出的 swapoff 和 swapon 的现象之后这个假设也可以简单的验证为不成立。于是他们又另外去查，最后查到是 min_free_kbytes 这个参数，被设置为 5G

# cat -n /etc/sysctl.conf | grep min_free_kbytes
    39	vm.min_free_kbytes=5000000

这个参数从字面意思也很好理解，看到这里，https://my.vertica.com/docs/7….?，提供了一个这个值的参考配置方法，是将总的内存数乘以 16 再开根号

To manually set min_free_kbytes:

Determine the current/default setting with the following command:
/sbin/sysctl vm.min_free_kbytes
If the result of the previous command is No such file or directory or the default value is less than 4096, then run the command below:
memtot=`grep MemTotal /proc/meminfo | awk '{printf "%.0f",$2}'`
echo "scale=0;sqrt ($memtot*16)" | bc
Edit or add the current value of vm.min_free_kbytes in /sbin/sysctl.conf with the value from the output of the previous command.

# The min_free_kbytes setting

vm.min_free_kbytes=5572

Run sysctl -p to apply the changes in sysctl.conf immediately.

Note: These steps will need to be replicated for each node in the cluster.

按照这个方法，修改配置文件，将值改为 11428，然后让配置生效

# sysctl -p

立刻就可以看到交换分区的使用降下来了

# free -m
             total       used       free     shared    buffers     cached
Mem:          7972       1970       6001          0          0       1836
-/+ buffers/cache:        133       7839
Swap:         1913          0       1913

再次编译和运行业务进程，也没有卡顿，系统负载也正常，至此，问题终于解决。

痛定思痛，关于虚拟内存的系统参数，到底都有哪些呢，看到这里，https://access.redhat.com/docu…?，里面提出了很多：

5.5.?Tuning Virtual Memory

Virtual memory is typically consumed by processes, file system caches, and the kernel. Virtual memory utilization depends on a number of factors, which can be affected by the following parameters.
swappiness
A value from 0 to 100 which controls the degree to which the system swaps. A high value prioritizes system performance, aggressively swapping processes out of physical memory when they are not active. A low value prioritizes interactivity and avoids swapping processes out of physical memory for as long as possible, which decreases response latency. The default value is 60.

A high swappiness value is not recommended for database workloads. For example, for Oracle databases, Red?Hat recommends a swappiness value of 10.
vm.swappiness=10
min_free_kbytes

The minimum number of kilobytes to keep free across the system. This value is used to compute a watermark value for each low memory zone, which are then assigned a number of reserved free pages proportional to their size.

Extreme values can break your system

Be cautious when setting this parameter, as both too-low and too-high values can be damaging.

Setting min_free_kbytes too low prevents the system from reclaiming memory. This can result in system hangs and OOM-killing multiple processes.

However, setting this parameter to a value that is too high (5-10% of total system memory) will cause your system to become out-of-memory immediately. Linux is designed to use all available RAM to cache file system data. Setting a high min_free_kbytes value results in the system spending too much time reclaiming memory.

dirty_ratio

Defines a percentage value. Writeout of dirty data begins (via pdflush) when dirty data comprises this percentage of total system memory. The default value is 20.

Red?Hat recommends a slightly lower value of 15 for database workloads.

dirty_background_ratio

Defines a percentage value. Writeout of dirty data begins in the background (via pdflush) when dirty data comprises this percentage of total memory. The default value is 10. For database workloads, Red?Hat recommends a lower value of 3.

dirty_expire_centisecs

Specifies the number of centiseconds (hundredths of a second) dirty data remains in the page cache before it is eligible to be written back to disk. Red?Hat does not recommend tuning this parameter.

dirty_writeback_centisecs

Specifies the length of the interval between kernel flusher threads waking and writing eligible data to disk, in centiseconds (hundredths of a second). Setting this to 0 disables periodic write behavior. Red?Hat does not recommend tuning this parameter.

drop_caches

Setting this value to 1, 2, or 3 causes the kernel to drop various combinations of page cache and slab cache.

1

The system invalidates and frees all page cache memory.

2

The system frees all unused slab cache memory.

3

The system frees all page cache and slab cache memory.

This is a non-destructive operation. Since dirty objects cannot be freed, running sync before setting this parameter’s value is recommended.

Important

Using the drop_caches to free memory is not recommended in a production environment.
To set these values temporarily during tuning, echo the desired value to the appropriate file in the proc file system. For example, to set swappiness temporarily to 50, run:
# echo 50 > /proc/sys/vm/swappiness
To set this value persistently, you will need to use the sysctl command.

需要注意的是，在调那个 min_free_kbytes 的参数时，过大或者过小，都是有问题的

9 thoughts on “Linux 交换分区耗尽”

wanax on 2014-10-10 at 13:55 said:

这篇文章写得很不错啊

Reply ↓
- ZRJ on 2014-10-10 at 14:21 said:
  
  还没写完，事情处理到这儿，没思路了，我是先贴个半成品出来，让另外一个朋友帮我看看
  
  Reply ↓
  - pishuang on 2016-04-19 at 11:50 said:
    
    博主写的非常不错！我说一下我的理解吧，在设置了vm.swapness=0这个参数之后systcl -p，重启第一个脚本里列出来的非系统进程，其实就可以释放内存了，尤其是重启mysql的时候，free -m 可以明显的感觉swap的空间在下降，那是因为mysql将内存中的相关数据刷到了磁盘上，重启过程中会释放内存。vm.swapness这个参数的值是可用内存小于总内存的百分之多少使用swap分区，默认的是60，也就说系统判断可用内存小于60%就会启用swap内存，所以你那个JAVA的进程一启动的时候其实就已经占用了50%的内存空间，就会导致后面的问题。
    
    Reply ↓
    - ZRJ on 2016-04-19 at 18:56 said:
      
      这么长的文章居然看了，也是不容易，哈哈
      
      Reply ↓
      - pishuang on 2016-04-19 at 19:40 said:
        
        由一个问题引出更多的问题并给出解释，不得不让人看到最后啊。
      - ZRJ on 2016-04-20 at 08:23 said:
        
        哈哈，也就当年不忙的时候能这么记录，现在遇到问题都是速战速决了
CZL on 2019-01-10 at 10:22 said:

写得很详细，帮我解决了问题，感谢

Reply ↓
- ZRJ on 2019-01-10 at 15:38 said:
  
  嘿嘿
  
  Reply ↓
jcren on 2019-04-29 at 09:00 said:

没能解决问题，但是感觉作者写的好棒，谢谢~

Reply ↓

ZRJ

学习笔记

Linux 交换分区耗尽

What is swappiness and how do I change it?

5.5.?Tuning Virtual Memory

9 thoughts on “Linux 交换分区耗尽”

Leave a Reply to ZRJ Cancel reply