记一次虚机强制断电磁盘损坏导致 K8s 集群 部分节点未就绪(NotReady) 问题解决
admin
2024-05-14 03:16:54
0

写在前面


  • 自己的实验环境遇到,分享解决过程
  • 理解不足小伙伴帮忙指正

我所渴求的,無非是將心中脫穎語出的本性付諸生活,為何竟如此艱難呢 ------赫尔曼·黑塞《德米安》


我遇到了什么问题

哈,中午走的时候钥匙被锁屋里了,急着回家找师傅开门,单位的 nuc 要带回去,就把自己 nuc 强制关机了,结果虚机部署的 k8s 集群回来都起不来了,不过还不算太糟糕, 至少 master 还在,不幸的万幸。 之前有一次是强制关机了,结果也是 k8s 集群都起不来了,etcd 对应的 pod 也挂掉了,没有备份,最后没办法,使用 kubeadm 重置集群了。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
NAME                          STATUS     ROLES                  AGE    VERSION
vms155.liruilongs.github.io   NotReady                    76d    v1.22.2
vms156.liruilongs.github.io   NotReady                    76d    v1.22.2
vms81.liruilongs.github.io    Ready      control-plane,master   400d   v1.22.2
vms82.liruilongs.github.io    NotReady                    400d   v1.22.2
vms83.liruilongs.github.io    Ready                       400d   v1.22.2
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

哈,部分集群节点未就绪,对应的虚机也起不来,下面为开机启动直接进入救援模式的提示信息。

[ 9.800336] XFS (sdal): Metadata corruption detected at xfs_agf_read_verify+0
×78/8×12[ xfs], xfs_agf block 0x4b000019.008356] XFS(sda1): Unmount and run xfs_repair
9.008376] XFS(sdal): First 64 butes of corrupted metadata buffer:
9.808395] ffff88803610a400:58 41 47 46 80 0880 01 80 08 80 81 80 96 88 88
XAGF...….
I9.8884151ffff88803610a410:80 88 80 81 80 88 88 82 80 88 80 88 80 88 80 819.808435] ffff88803610a420:80 88 80 81 80 88 80 88 80 88 88 88 80 8880 83
I 9.080454] ffff88003610a430:00 80 00 84 00 8d d1 2d 00 77 c3 a3 00 80 08 88
....-.w.……
I9.080515] XFS(sdal): metadata I/0 error: block 0x4b00001 ("xfs_trans_read_
buf_map") error 117 numblks 1
Generating "/run/initramfs/rdsosreport. txt."
Entering emergency mode. Exit the shell to continue.
Type "journalctl"to view system logs.
You might want to save "/run/initramfs/rdsosreport. txt"to a USB stick or /boot after mounting them and attach it to a bug report.
:/#

磁盘损坏,需要修复。哈,太坑了

我是如何做的

寻找磁盘恢复的解决方案,操作步骤:

  1. 启动虚拟机 E 进入单用户模式
  2. linux16 开头的那行末尾添加 rd.break
  3. 在上一步的基础上 ctrl+x 进入救援模式,然后执行 xfs_repair -L /dev/sda1 : 这里的 sda1 是上面损坏的磁盘,可以在救援模式的输出中看到。
  4. 执行 reboot

OK,陆续修复磁盘,开机,然后查看节点,发现恢复了一个,还是有两个节点未就绪。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
NAME                          STATUS     ROLES                  AGE    VERSION
vms155.liruilongs.github.io   NotReady                    76d    v1.22.2
vms156.liruilongs.github.io   Ready                       76d    v1.22.2
vms81.liruilongs.github.io    Ready      control-plane,master   400d   v1.22.2
vms82.liruilongs.github.io    NotReady                    400d   v1.22.2
vms83.liruilongs.github.io    Ready                       400d   v1.22.2
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

我最开始以为 kubectl 的问题,排查了日志发现没有问题。

┌──[root@vms82.liruilongs.github.io]-[~]
└─$systemctl status kubelet.service
● kubelet.service - kubelet: The Kubernetes Node AgentLoaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)Drop-In: /usr/lib/systemd/system/kubelet.service.d└─10-kubeadm.confActive: active (running) since 二 2023-01-17 20:53:02 CST; 1min 18s ago....

然后在集群事件中,发现 Is the docker daemon running?, Error while dialing dial unix /run/containerd/containerd. sock: connect: connection refused": unavailable 类似的事件提示。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get events | grep -i error
54m         Warning   Unhealthy                pod/calico-node-nfkzd                                 Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory
54m         Warning   Unhealthy                pod/calico-node-nfkzd                                 Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
44m         Warning   FailedCreatePodSandBox   pod/calico-node-vxpxt                                 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "calico-node-vxpxt": Error response from daemon: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
44m         Warning   FailedCreatePodSandBox   pod/calico-node-vxpxt                                 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "calico-node-vxpxt": Error response from daemon: transport is closing: unavailable
44m         Warning   FailedCreatePodSandBox   pod/kube-proxy-htg7t                                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "kube-proxy-htg7t": Error response from daemon: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
44m         Warning   FailedCreatePodSandBox   pod/kube-proxy-htg7t                                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "kube-proxy-htg7t": Error response from daemon: transport is closing: unavailable
44m         Warning   FailedCreatePodSandBox   pod/kube-proxy-htg7t                                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "kube-proxy-htg7t": error during connect: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.41/containers/create?name=k8s_POD_kube-proxy-htg7t_kube-system_85fe510d-d713-4fe6-b852-dd1655d37fff_15": EOF
44m         Warning   FailedKillPod            pod/skooner-5b65f884f8-9cs4k                          error killing pod: failed to "KillPodSandbox" for "eb888be0-5f30-4620-a4a2-111f14bb092d" with KillPodSandbo
Error: "rpc error: code = Unknown desc = [networkPlugin cni failed to teardown pod \"skooner-5b65f884f8-9cs4k_kube-system\" network: error getting ClusterInformation: Get \"https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default\": dial tcp 10.96.0.1:443: connect: connection refused, Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?]"
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

有可以能有些节点的 docker 没有起来,然后我查看了未就绪节点的 docker 的状态

┌──[root@vms82.liruilongs.github.io]-[~]
└─$systemctl  status docker
● docker.service - Docker Application Container EngineLoaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)Active: inactive (dead)Docs: https://docs.docker.com1月 17 21:08:19 vms82.liruilongs.github.io systemd[1]: Dependency failed for Docker Application Container Engine.
1月 17 21:08:19 vms82.liruilongs.github.io systemd[1]: Job docker.service/start failed with result 'dependency'.
1月 17 21:08:25 vms82.liruilongs.github.io systemd[1]: Dependency failed for Docker Application Container Engine.
1月 17 21:08:25 vms82.liruilongs.github.io systemd[1]: Job docker.service/start failed with result 'dependency'.
1月 17 21:08:30 vms82.liruilongs.github.io systemd[1]: Dependency failed for Docker Application Container Engine.
。。。。。。。

发现 docker 果然没有启动成功,提示他的依赖没有启动成功,查看一下 docker 的正向依赖,即在 docker 之前启动的服务

┌──[root@vms82.liruilongs.github.io]-[~]
└─$systemctl list-dependencies docker.service
docker.service
● ├─containerd.service
● ├─docker.socket
● ├─system.slice
● ├─basic.target
● │ ├─microcode.service
● │ ├─rhel-autorelabel-mark.service
● │ ├─rhel-autorelabel.service
● │ ├─rhel-configure.service
● │ ├─rhel-dmesg.service
● │ ├─rhel-loadmodules.service
● │ ├─selinux-policy-migrate-local-changes@targeted.service
● │ ├─paths.target
● │ ├─slices.target
● │ │ ├─-.slice
● │ │ └─system.slice
● │ ├─sockets.target
............................

然后我们看一下第一个 依赖的服务 containerd.service ,查看发现也么有启动成功

┌──[root@vms82.liruilongs.github.io]-[~]
└─$systemctl status containerd.service
● containerd.service - containerd container runtimeLoaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)Active: activating (auto-restart) (Result: exit-code) since 二 2023-01-17 21:14:58 CST; 4s agoDocs: https://containerd.ioProcess: 6494 ExecStart=/usr/bin/containerd (code=exited, status=2)Process: 6491 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)Main PID: 6494 (code=exited, status=2)1月 17 21:14:58 vms82.liruilongs.github.io systemd[1]: Failed to start containerd container runtime.
1月 17 21:14:58 vms82.liruilongs.github.io systemd[1]: Unit containerd.service entered failed state.
1月 17 21:14:58 vms82.liruilongs.github.io systemd[1]: containerd.service failed.
┌──[root@vms82.liruilongs.github.io]-[~]
└─$

没有更多的提示信息,只是提示 启动失败了,这里我们尝试重启试试

┌──[root@vms82.liruilongs.github.io]-[~]
└─$systemctl restart containerd.service
Job for containerd.service failed because the control process exited with error code. See "systemctl status containerd.service" and "journalctl -xe" for details.

查看 containerd 服务日志,这里先查看一下 error 的信息

┌──[root@vms82.liruilongs.github.io]-[~]
└─$journalctl -u  containerd | grep -i error -m 3
1月 17 20:41:56 vms82.liruilongs.github.io containerd[962]: time="2023-01-17T20:41:56.203387028+08:00" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.aufs\"..." error="aufs is not supported (modprobe aufs failed: exit status 1 \"modprobe: FATAL: Module aufs not found.\\n\"): skip plugin" type=io.containerd.snapshotter.v1
1月 17 20:41:56 vms82.liruilongs.github.io containerd[962]: time="2023-01-17T20:41:56.203699262+08:00" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
1月 17 20:41:56 vms82.liruilongs.github.io containerd[962]: time="2023-01-17T20:41:56.204050775+08:00" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
┌──[root@vms82.liruilongs.github.io]-[~]
└─$

我们通过日志,得到了下面的日志信息,猜测可能是磁盘损坏照成的,这里我们备份 /var/lib/containerd/ 对应的文件夹,删除试试

aufs is not supported (modprobe aufs failed: exit status 1 \"modprobe: FATAL: Module aufs not found.
path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: sk

删除文件夹下所有文件

┌──[root@vms82.liruilongs.github.io]-[~]
└─$cd /var/lib/containerd/
io.containerd.content.v1.content/       io.containerd.runtime.v1.linux/         io.containerd.snapshotter.v1.native/    tmpmounts/
io.containerd.metadata.v1.bolt/         io.containerd.runtime.v2.task/          io.containerd.snapshotter.v1.overlayfs/
┌──[root@vms82.liruilongs.github.io]-[~]
└─$cd /var/lib/containerd/
┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]
└─$rm -rf *
┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]
└─$ls

删除之后尝试重新 启动 containerd

┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl start containerd
┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl status containerd
● containerd.service - containerd container runtimeLoaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)Active: active (running) since 二 2023-01-17 21:25:13 CST; 51s agoDocs: https://containerd.ioProcess: 8180 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)Main PID: 8182 (containerd)Memory: 146.8M...........

OK ,启动成功,这个时候我们发现 节点也正常了。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
NAME                          STATUS     ROLES                  AGE    VERSION
vms155.liruilongs.github.io   NotReady                    76d    v1.22.2
vms156.liruilongs.github.io   Ready                       76d    v1.22.2
vms81.liruilongs.github.io    Ready      control-plane,master   400d   v1.22.2
vms82.liruilongs.github.io    Ready                       400d   v1.22.2
vms83.liruilongs.github.io    Ready                       400d   v1.22.2
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

其他的节点陆续操作下

192.168.26.155 发现这个节点 docker 也是启动失败的,但是问题不一样,操作后,服务没有自动重启,日志有error 级别的日志,

┌──[root@vms81.liruilongs.github.io]-[~]
└─$ssh root@192.168.26.155
Last login: Mon Jan 16 02:26:43 2023 from 192.168.26.81
┌──[root@vms155.liruilongs.github.io]-[~]
└─$systemctl is-active  docker
failed
┌──[root@vms155.liruilongs.github.io]-[~]
└─$cd /var/lib/containerd/
┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]
└─$rm -rf *
┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl start containerd
┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl is-active  docker
failed
┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl status docker
● docker.service - Docker Application Container EngineLoaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)Active: failed (Result: start-limit) since 二 2023-01-17 20:20:03 CST; 1h 31min agoDocs: https://docs.docker.comProcess: 2030 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=0/SUCCESS)Main PID: 2030 (code=exited, status=0/SUCCESS)1月 17 20:20:02 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:02.796621853+08:00" level=error msg="712fd90a1962d0f546eaf6c9db05c2577ac9855b38f9f41e37724402f10d3045 cleanup: failed to de...
1月 17 20:20:02 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:02.796669296+08:00" level=error msg="Handler for POST /v1.41/containers/712fd90a1962d0f546eaf6c9db05c2577ac9855b38f9f41e377...
1月 17 20:20:03 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:03.285529266+08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containe...
1月 17 20:20:03 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:03.783878143+08:00" level=info msg="Processing signal 'terminated'"
1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: Stopping Docker Application Container Engine...
1月 17 20:20:03 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:03.784550238+08:00" level=info msg="Daemon shutdown complete"
1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: start request repeated too quickly for docker.service
1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: Failed to start Docker Application Container Engine.
1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: Unit docker.service entered failed state.
1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: docker.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

服务没有自动重启,这里手动重启试试

┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl restart docker

查看节点状态,所有节点 ready。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
NAME                          STATUS   ROLES                  AGE    VERSION
vms155.liruilongs.github.io   Ready                     76d    v1.22.2
vms156.liruilongs.github.io   Ready                     76d    v1.22.2
vms81.liruilongs.github.io    Ready    control-plane,master   400d   v1.22.2
vms82.liruilongs.github.io    Ready                     400d   v1.22.2
vms83.liruilongs.github.io    Ready                     400d   v1.22.2
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

博文参考


https://blog.csdn.net/qq_35022803/article/details/109287086

相关内容

热门资讯

和讯投顾王宇:还剩三天,题材能... 2月10日,和讯投顾王宇表示,当下市场题材分歧愈发明显,节前仅剩三个交易日,到底还能不能继续博弈?极...
美股三大指数集体高开 新闻荐读 美东时间周二,美股三大指数集体高开,截至北京时间23:00左右,道指涨逾360点,续创盘中...
华润置地,任命新的集团营销负责... 近日,市场消息,华润置地原杭州公司分管副总朱勇,被任命为集团运营管理部副总经理,分管整个开发销售型业...
原创 比... 当地时间2月9日,路透社的一则报道引发广泛关注:中国汽车制造商比亚迪已正式对美国政府提起诉讼,剑指美...
裕同科技拟4.49亿元收购华研... 苹果、泡泡玛特主要供应商裕同科技(002831)2月10日晚间公告,拟4.49亿元收购东莞市华研新材...
巨亏50亿!AI制药独角兽IP... 2月4日,AI+可编程药物公司Generate:Biomedicines向美国证券交易委员会(SEC...
多家知名品牌宣布:春节期间,涨... 过年脚步临近,春节假期是线上消费的高峰期,也是运力最为紧张的时期。顺丰、京东物流、圆通、申通、中通、...
担忧受AI冲击 美国微软公司股... (央视财经《天下财经》)据彭博社等媒体9日报道,当天,美国梅利乌斯研究公司将微软公司的股票评级从“买...
原创 2... 写在文章前的声明:在本文之前的说明:本文中所列的投资信息,只是一个对基金资产净值进行排行的客观描述,...
青藏高原发现天然氢:可以拉开能... 近日,中国科学家在青藏高原的岩石里,第一次实实在在地“看到”了天然氢气。 它不是通过计算或地表泄漏推...
上海小南国回应餐厅停止营运 北京商报讯(记者 郭缤璐)2月10日晚间,上海小南国控股有限公司发布公告,对于上海小南国品牌在上海经...
招金矿业致歉:向遇难者表示沉痛... 2月10日,招金矿业在港交所发布公告: 最近,招金矿业股份有限公司蚕庄金矿(“蚕庄金矿”)上庄矿段...
君乐宝上市布局细分赛道领跑增长... 2026年1月19日,中国领先的综合乳制品企业君乐宝乳业集团股份有限公司(简称“君乐宝”)正式向香港...
酒业渠道商再闯港股,名品世家董... 蓝鲸新闻2月10日讯(记者 朱欣悦)2月10日,港股GEM上市公司环球印馆控股有限公司(08448....
上市银行,迎密集调研! 2026年以来,上市银行迎来机构密集调研,其中沿海经济发达区域的中小银行是机构重点调研对象。 截至2...
豆包压哨参战,决战春节流量窗口... 本报(chinatimes.net.cn)记者卢晓 北京报道 春节AI红包大战继续升温。 2月10日...
小米汽车准备进入美国市场?雷军... 2月10日,针对小米YU7被拍到行驶在美国加州的高速公路上,挂着当地的测试车牌或进入美国市场的消息,...
春节后部分黄金产品将调价?周大... 2月10日,有消息称,黄金珠宝品牌周大福在春节后将对黄金产品调价,此次调价或于3月中旬启动,目前部分...
标普500指数高开9.67点,... 标普500指数高开9.67点,涨幅0.14%,报6974.49点; 道琼斯工业平均指数高开57.6...
中国银行调整春节期间代理个人上... 中国银行2月10日发布关于2026年春节期间代理个人上金所业务相关调整的公告称,2026年春节假期临...