Kubernetes监控插件-NPD
Node Problem Detector(NPD)是Kubernetes集群中一个重要的监控插件,主要作用是监控节点的健康状况并检测可能出现的问题。如NTP、文件系统损坏、内核死锁、CPU、内存或磁盘损坏、运行时守护进程无响应等。
部署
使用chart安装
# helm -n monitoring install node-problem-detector https://dl.vqiu.cn/helm_chart/node-problem-detector/node-problem-detector-2.3.14.tgz
NOTES:
To verify that the node-problem-detector pods have started, run:
kubectl --namespace=monitoring get pods -l "app.kubernetes.io/name=node-problem-detector,app.kubernetes.io/instance=node-problem-detector"
monitoring命名空间下面将会多一个名为`node-problem-detector`的控制器
# kubectl --namespace=monitoring get pods -l "app.kubernetes.io/name=node-problem-detector,app.kubernetes.io/instance=node-problem-detector"
NAME READY STATUS RESTARTS AGE
node-problem-detector-2shts 1/1 Running 0 45h
node-problem-detector-8h5zn 1/1 Running 0 45h
node-problem-detector-ftkwq 1/1 Running 0 45h
node-problem-detector-k5xvd 1/1 Running 0 45h
node-problem-detector-lh8df 1/1 Running 0 45h
node-problem-detector-lqpxm 1/1 Running 0 45h
node-problem-detector-lwlcc 1/1 Running 0 45h
node-problem-detector-m8nn9 1/1 Running 0 45h
node-problem-detector-mdlpg 1/1 Running 0 45h
node-problem-detector-nzbz4 1/1 Running 0 45h
node-problem-detector-pt6jz 1/1 Running 0 45h
node-problem-detector-sxfsc 1/1 Running 0 45h
node-problem-detector-tgjn8 1/1 Running 0 45h
node-problem-detector-tvhn6 1/1 Running 0 45h
node-problem-detector-vz4s4 1/1 Running 0 45h
node-problem-detector-xcw8x 1/1 Running 0 45h
node-problem-detector-z8glj 1/1 Running 0 45h
模拟内核事件触发
# echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg
此时使用kubectl可以看到对应的事件信息
# kubectl describe node xxx
...省略若干行...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...省略若干行...
Warning KernelOops 15m kernel-monitor kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING