Kubernetes监控插件-NPD

Node Problem Detector(NPD)是Kubernetes集群中一个重要的监控插件,主要作用是监控节点的健康状况并检测可能出现的问题。如NTP、文件系统损坏、内核死锁、CPU、内存或磁盘损坏、运行时守护进程无响应等。

部署

使用chart安装

# helm -n monitoring install node-problem-detector https://dl.vqiu.cn/helm_chart/node-problem-detector/node-problem-detector-2.3.14.tgz
NOTES:
To verify that the node-problem-detector pods have started, run:

  kubectl --namespace=monitoring get pods -l "app.kubernetes.io/name=node-problem-detector,app.kubernetes.io/instance=node-problem-detector"

monitoring命名空间下面将会多一个名为`node-problem-detector`的控制器

# kubectl --namespace=monitoring get pods -l "app.kubernetes.io/name=node-problem-detector,app.kubernetes.io/instance=node-problem-detector"
NAME                          READY   STATUS    RESTARTS   AGE
node-problem-detector-2shts   1/1     Running   0          45h
node-problem-detector-8h5zn   1/1     Running   0          45h
node-problem-detector-ftkwq   1/1     Running   0          45h
node-problem-detector-k5xvd   1/1     Running   0          45h
node-problem-detector-lh8df   1/1     Running   0          45h
node-problem-detector-lqpxm   1/1     Running   0          45h
node-problem-detector-lwlcc   1/1     Running   0          45h
node-problem-detector-m8nn9   1/1     Running   0          45h
node-problem-detector-mdlpg   1/1     Running   0          45h
node-problem-detector-nzbz4   1/1     Running   0          45h
node-problem-detector-pt6jz   1/1     Running   0          45h
node-problem-detector-sxfsc   1/1     Running   0          45h
node-problem-detector-tgjn8   1/1     Running   0          45h
node-problem-detector-tvhn6   1/1     Running   0          45h
node-problem-detector-vz4s4   1/1     Running   0          45h
node-problem-detector-xcw8x   1/1     Running   0          45h
node-problem-detector-z8glj   1/1     Running   0          45h

模拟内核事件触发

# echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg

此时使用kubectl可以看到对应的事件信息

 # kubectl describe node xxx
 ...省略若干行...
Events:
  Type     Reason                   Age   From             Message
  ----     ------                   ----  ----             -------
 ...省略若干行...
  Warning  KernelOops               15m   kernel-monitor   kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING

参考引用