本番導入しようとして困った
某コンテストの予選の環境にLinuxコマンドの使えるホストが50個ほど必要だったので、LXD/LXCでまかなうことにした。
Failed to allocate directory watch: Too many open files
とか Failed to allocate directory watch: Too many open files
とかエラーが出て困っている人向け。
実験
環境
- さくらのクラウド(CPU 1 Core, Memory 1GB, SSD 20GB)
- Ubuntu16.04
- LXD 2.8
構築
$ sudo apt -y install software-properties-common
$ sudo add-apt-repository ppa:ubuntu-lxc/lxd-stable
$ sudo apt update && sudo apt dist-upgrade
$ sudo apt -y install lxd zfs
$ newgrp lxd
$ sudo lxd init
LXDのコンテナを20個立ち上げてみる
手動で20回コマンド打つのはつらいのでスクリプト書いたほうが速いかもしれない。
ubuntu@lxd01:~$ lxc launch ubuntu:16.04 c01
Creating c01
Starting c01
ubuntu@lxd01:~$ lxc launch ubuntu:16.04 c02
Creating c02
Starting c02
...
ubuntu@lxd01:~$ lxc launch ubuntu:16.04 c20
Creating c20
Starting c20
立ち上がったコンテナを確認
ubuntu@lxd01:~$ lxc list
+------+---------+----------------------+------+------------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------+---------+----------------------+------+------------+-----------+
| c01 | RUNNING | 10.58.243.4 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c02 | RUNNING | 10.58.243.10 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c03 | RUNNING | 10.58.243.124 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c04 | RUNNING | 10.58.243.145 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c05 | RUNNING | 10.58.243.2 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c06 | RUNNING | 10.58.243.174 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c07 | RUNNING | 10.58.243.252 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c08 | RUNNING | 10.58.243.218 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c09 | RUNNING | 10.58.243.247 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c10 | RUNNING | 10.58.243.93 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c11 | RUNNING | 10.58.243.189 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c12 | RUNNING | 10.58.243.13 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c13 | RUNNING | 10.58.243.90 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c14 | RUNNING | 10.58.243.177 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c15 | RUNNING | 10.58.243.71 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c16 | RUNNING | 10.58.243.248 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c17 | RUNNING | | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c18 | RUNNING | | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c19 | RUNNING | | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c20 | RUNNING | | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
なんかおかしい
16個くらいコンテナを立ち上げたあたりからIPアドレスが振られていない。
c16とc17のプロセスの状態を比較してみる。
ubuntu@lxd01:~$ lxc exec c16 -- ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.2 37396 2552 ? Ss 04:07 0:01 /sbin/init
root 46 0.0 0.1 33436 1652 ? Ss 04:07 0:00 /lib/systemd/systemd-journald
root 214 0.0 0.0 16128 80 ? Ss 04:09 0:00 /sbin/dhclient -1 -v -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases -I -df
root 306 0.0 0.1 26068 1096 ? Ss 04:09 0:00 /usr/sbin/cron -f
root 308 0.0 0.1 20100 1268 ? Ss 04:09 0:00 /lib/systemd/systemd-logind
root 310 0.0 0.3 436792 3176 ? Ssl 04:09 0:00 /usr/lib/accountsservice/accounts-daemon
daemon 311 0.0 0.0 26044 880 ? Ss 04:09 0:00 /usr/sbin/atd -f
syslog 313 0.0 0.1 186900 1428 ? Ssl 04:09 0:00 /usr/sbin/rsyslogd -n
message+ 314 0.0 0.1 42896 1552 ? Ss 04:09 0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root 324 0.0 0.2 65520 2724 ? Ss 04:09 0:00 /usr/sbin/sshd -D
root 325 0.0 0.7 263660 7796 ? Ssl 04:09 0:00 /usr/lib/snapd/snapd
root 340 0.0 0.1 12844 1056 console Ss+ 04:09 0:00 /sbin/agetty --noclear --keep-baud console 115200 38400 9600 linux
root 363 0.0 0.3 277080 3292 ? Ssl 04:09 0:00 /usr/lib/policykit-1/polkitd --no-debug
root 666 0.0 0.1 34424 1892 ? Rs+ 04:17 0:00 ps aux
ubuntu@lxd01:~$ lxc exec c17 -- ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 37264 808 ? Ss 04:07 0:00 /sbin/init
root 203 0.0 0.1 34424 1596 ? Rs+ 04:17 0:00 ps aux
c17以降はsystemdが立ち上がっていないことがわかった。
試しにc17でsshdを起動してみる。
root@c18:~# systemctl start ssh
Failed to allocate directory watch: Too many open files
Job for ssh.service canceled.
コンテナ側ではほとんどファイルを開いていないので、ホスト側の全体でのリソースの制限に引っかかってしまったようだ。
Failed to allocate directory watch: Too many open files
とか Failed to allocate directory watch: Too many open files
とか出たら同様の現象の可能性がある。
ドキュメントをよく読むと
LXDの公式リポジトリを見ていたら production-setup.md
にちゃんと本番導入する時に考慮しないといけないことが書いてあった。
github.com
対策
/etc/security/limits.conf
に以下の内容を追記
* soft nofile 1048576
* hard nofile 1048576
root soft nofile 1048576
root hard nofile 1048576
* soft memlock unlimited
* hard memlock unlimited
/etc/sysctl.conf に以下の内容を追記
fs.inotify.max_queued_events = 1048576
fs.inotify.max_user_instances = 1048576
fs.inotify.max_user_watches = 1048576
vm.max_map_count = 262144
上記2点の作業をしたら再起動。
確認
どのコンテナもちゃんとプロセスが動いている事を確認。
ubuntu@lxd01:~$ lxc list
+------+---------+----------------------+------+------------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------+---------+----------------------+------+------------+-----------+
| c01 | RUNNING | 10.58.243.4 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c02 | RUNNING | 10.58.243.10 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c03 | RUNNING | 10.58.243.124 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c04 | RUNNING | 10.58.243.145 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c05 | RUNNING | 10.58.243.2 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c06 | RUNNING | 10.58.243.174 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c07 | RUNNING | 10.58.243.252 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c08 | RUNNING | 10.58.243.218 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c09 | RUNNING | 10.58.243.247 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c10 | RUNNING | 10.58.243.93 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c11 | RUNNING | 10.58.243.189 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c12 | RUNNING | 10.58.243.13 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c13 | RUNNING | 10.58.243.90 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c14 | RUNNING | 10.58.243.177 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c15 | RUNNING | 10.58.243.71 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c16 | RUNNING | 10.58.243.248 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c17 | RUNNING | 10.58.243.36 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c18 | RUNNING | 10.58.243.184 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c19 | RUNNING | 10.58.243.211 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
| c20 | RUNNING | 10.58.243.23 (eth0) | | PERSISTENT | 0 |
+------+---------+----------------------+------+------------+-----------+
ubuntu@lxd01:~$ lxc exec c18 -- ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.3 0.1 37396 1184 ? Ss 04:39 0:00 /sbin/init
root 45 0.1 0.1 33436 1172 ? Ss 04:39 0:00 /lib/systemd/systemd-journald
root 50 0.0 0.0 41724 208 ? Ss 04:39 0:00 /lib/systemd/systemd-udevd
root 240 0.0 0.0 16120 52 ? Ss 04:40 0:00 /sbin/dhclient -1 -v -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases -I -df
message+ 312 0.0 0.0 42896 616 ? Ss 04:40 0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root 317 0.0 0.0 20100 392 ? Ss 04:40 0:00 /lib/systemd/systemd-logind
root 319 0.1 0.0 198124 0 ? Ssl 04:40 0:00 /usr/lib/snapd/snapd
root 322 0.0 0.0 27728 148 ? Ss 04:40 0:00 /usr/sbin/cron -f
daemon 323 0.0 0.0 26044 264 ? Ss 04:40 0:00 /usr/sbin/atd -f
root 324 0.1 0.0 634952 404 ? Ssl 04:40 0:00 /usr/lib/accountsservice/accounts-daemon
syslog 325 0.0 0.0 186900 580 ? Ssl 04:40 0:00 /usr/sbin/rsyslogd -n
root 326 0.0 0.0 65520 212 ? Ss 04:40 0:00 /usr/sbin/sshd -D
root 348 0.0 0.0 14476 312 console Ss+ 04:40 0:00 /sbin/agetty --noclear --keep-baud console 115200 38400 9600 linux
root 376 0.0 0.0 277180 368 ? Ssl 04:41 0:00 /usr/lib/policykit-1/polkitd --no-debug
root 466 0.0 0.1 34424 1648 ? Rs+ 04:42 0:00 ps aux
めでたしめでたし。