Skip to content

Commit 2edb43a

Browse files
committed
feat: update docs content
1 parent 7dc02ea commit 2edb43a

File tree

1 file changed

+27
-39
lines changed
  • i18n/zh/docusaurus-plugin-content-docs/current/03-getting-started/02-install-lower-layer-system

1 file changed

+27
-39
lines changed

i18n/zh/docusaurus-plugin-content-docs/current/03-getting-started/02-install-lower-layer-system/08-faqs.md

Lines changed: 27 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -8,21 +8,20 @@ custom_edit_url: null
88

99
## 问题一:kube-proxy 报 iptables 问题
1010

11-
```
11+
```bash
1212
E0627 09:28:54.054930 1 proxier.go:1598] Failed to execute iptables-restore: exit status 1 (iptables-restore: line 86 failed ) I0627 09:28:54.054962 1 proxier.go:879] Sync failed; retrying in 30s
1313
```
1414

15-
**解决:** 直接清理 iptables
16-
17-
```
15+
**解决:** 直接清理 iptables:
16+
```bash
1817
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
1918
```
2019

2120
## 问题二:calico 和 coredns 一直处于初始化
2221

23-
`kubectl describe <podname>` 应该会有 failed 的报错,大致内容是跟 network 和 sandbox 相关的
22+
`kubectl describe <podname>` 应该会有 failed 的报错,大致内容是跟 network 和 sandbox 相关的
2423

25-
```
24+
```bash
2625
Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "7f5b66ebecdfc2c206027a2afcb9d1a58ec5db1a6a10a91d4d60c0079236e401" network for pod "calico-kube-controllers-577f77cb5c-99t8z": networkPlugin cni failed to set up pod "calico-kube-controllers-577f77cb5c-99t8z_kube-system" network: error getting ClusterInformation: Get "https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.96.0. 1:443: i/o timeout, failed to clean up sandbox container "7f5b66ebecdfc2c206027a2afcb9d1a58ec5db1a6a10a91d4d60c0079236e401" network for pod "calico-kube-controllers-577f77cb5c-99t8z": networkPlugin cni failed to teardown pod "calico-kube-controllers-577f77cb5c-99t8z_kube-system" network: error getting ClusterInformation: Get "https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.96.0. 1:443: i/o timeout]
2726
```
2827

@@ -37,21 +36,20 @@ rm -rf /etc/cni/net.d/
3736

3837
## 问题三:metrics-server一直无法成功
3938

40-
**原因:** master 没有加污点
39+
**原因:** master 没有加污点
4140

4241
**解决:**
4342
```bash
4443
kubectl taint nodes --all node-role.kubernetes.io/master node-role.kubernetes.io/master-
4544
```
4645

47-
4846
## 问题四:10002 already in use
4947

50-
`journalctl -u cloudcore.service -xe` 时看到 xxx already in use
48+
`journalctl -u cloudcore.service -xe` 时看到 `xxx already in use`
5149

52-
**原因:** 应该是之前的记录没有清理干净
50+
**原因:** 应该是之前的记录没有清理干净
5351

54-
**解决:** 找到占用端口的进程,直接 Kill 即可
52+
**解决:** 找到占用端口的进程,直接 Kill 即可
5553
```bash
5654
lsof -i:xxxx
5755
kill xxxxx
@@ -64,7 +62,7 @@ execute keadm command failed: failed to exec 'bash -c sudo ln /etc/kubeedge/edg
6462
, err: exit status 1
6563
```
6664

67-
在尝试创建符号链接时,目标路径已经存在,因此无法创建。这通常是因为 `edgecore.service` 已经存在于 `/etc/systemd/system/` 目录中
65+
在尝试创建符号链接时,目标路径已经存在,因此无法创建。这通常是因为 `edgecore.service` 已经存在于 `/etc/systemd/system/` 目录中
6866

6967
**解决:**
7068
```bash
@@ -78,12 +76,11 @@ sudo rm /etc/systemd/system/edgecore.service
7876
12月 14 23:02:23 cloud.kubeedge cloudcore[196229]: TLSStreamCertFile: Invalid value: "/etc/kubeedge/certs/stream.crt": TLSStreamCertFile not exist
7977
12月 14 23:02:23 cloud.kubeedge cloudcore[196229]: TLSStreamCAFile: Invalid value: "/etc/kubeedge/ca/streamCA.crt": TLSStreamCAFile not exist
8078
12月 14 23:02:23 cloud.kubeedge cloudcore[196229]: ]
81-
8279
```
8380

8481
**解决:**
8582

86-
查看 `/etc/kubeedge` 下是否有 `certgen.sh` 并且 `bash certgen.sh stream`
83+
查看 `/etc/kubeedge` 下是否有 `certgen.sh` 并且 `bash certgen.sh stream`
8784

8885
## 问题七:edgemesh 的 log 边边互联成功,云边无法连接
8986

@@ -97,7 +94,7 @@ sudo rm /etc/systemd/system/edgecore.service
9794

9895
1. 首先每个节点上的 edgemesh-agent 都具有 peer ID,比如
9996

100-
```bash
97+
```
10198
edge2:
10299
I'm {12D3KooWPpY4GqqNF3sLC397fMz5ZZfxmtMTNa1gLYFopWbHxZDt: [/ip4/127.0.0.1/tcp/20006 /ip4/192.168.1.4/tcp/20006]}
103100
@@ -109,35 +106,28 @@ a. peer ID是根据节点名称哈希出来的,相同的节点名称会哈希
109106
b. 另外,节点名称不是服务器名称,是k8s node name,请用kubectl get nodes查看
110107
```
111108

112-
2.如果访问节点和被访问节点处于同一个局域网内(**所有节点应该具备内网 IP(10.0.0.0/8、172.16.0.0/12、192.168.0.0/16**),请看[全网最全EdgeMesh Q&A手册 - 知乎 (zhihu.com)](https://zhuanlan.zhihu.com/p/585749690)**问题十二**同一个局域网内 edgemesh-agent 互相发现对方时的日志是 `[MDNS] Discovery found peer: <被访问端peer ID: [被访问端IP列表(可能会包含中继节点IP)]>` ^d3939d
109+
2. 如果访问节点和被访问节点处于同一个局域网内(**所有节点应该具备内网 IP(10.0.0.0/8、172.16.0.0/12、192.168.0.0/16**),请看[全网最全EdgeMesh Q&A手册 - 知乎 (zhihu.com)](https://zhuanlan.zhihu.com/p/585749690)**问题十二**同一个局域网内 edgemesh-agent 互相发现对方时的日志是 `[MDNS] Discovery found peer: <被访问端peer ID: [被访问端IP列表(可能会包含中继节点IP)]>`
113110

114-
3.如果访问节点和被访问节点跨子网,这时候应该看看 relayNodes 设置的正不正确,为什么中继节点没办法协助两个节点交换 peer 信息。详细材料请阅读:[KubeEdge EdgeMesh 高可用架构详解](https://link.zhihu.com/?target=https%3A//mp.weixin.qq.com/s/4whnkMM9oOaWRsI1ICsvSA)。跨子网的 edgemesh-agent 互相发现对方时的日志是 `[DHT] Discovery found peer: <被访问端peer ID: [被访问端IP列表(可能会包含中继节点IP)]>`(适用于我的情况)
111+
3. 如果访问节点和被访问节点跨子网,这时候应该看看 relayNodes 设置的正不正确,为什么中继节点没办法协助两个节点交换 peer 信息。详细材料请阅读:[KubeEdge EdgeMesh 高可用架构详解](https://link.zhihu.com/?target=https%3A//mp.weixin.qq.com/s/4whnkMM9oOaWRsI1ICsvSA)。跨子网的 edgemesh-agent 互相发现对方时的日志是 `[DHT] Discovery found peer: <被访问端peer ID: [被访问端IP列表(可能会包含中继节点IP)]>`
115112

116113
**解决:**
117114

118-
在部署 edgemesh 进行 `kubectl apply -f build/agent/resources/` 操作时,修改 04-configmap,添加 relayNode(根本原因在于,不符合“访问节点和被访问节点处于同一个局域网内”,所以需要添加 relayNode)
115+
在部署 edgemesh 进行 `kubectl apply -f build/agent/resources/` 操作时,修改 04-configmap,添加 relayNode(根本原因在于,不符合“访问节点和被访问节点处于同一个局域网内”,所以需要添加 relayNode)
119116

120117
![Q7](/img/FAQs/Q7.png)
121118

122119
## 问题八:master 的gpu 存在但是找不到 gpu 资源
123120

124-
主要针对的是服务器的情况,可以使用 `nvidia-smi` 查看显卡情况
121+
主要针对的是服务器的gpu使用情况,可以使用 `nvidia-smi` 查看服务器显卡情况
125122

126-
需要配置GPU支持
123+
需要配置容器的GPU支持
127124

128-
>[!quote]
129-
>[Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit 1.14.3 documentation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuration)
125+
> 参考如下链接:
126+
> [Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit 1.14.3 documentation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuration)
130127
131-
按上述内容进行配置,然后还需要 `vim /etc/docker/daemon.json`,添加 default-runtime。按照 quote 的内容设置后会有“runtimes”但是 default-runtime 不会设置,可能会导致找不到 GPU 资源
132-
133-
```
128+
按上述内容进行配置,然后还需要 `vim /etc/docker/daemon.json`,添加 default-runtime:
129+
```json
134130
{
135-
"exec-opts": [
136-
"native.cgroupdriver=systemd"
137-
],
138-
"registry-mirrors": [
139-
"https://b9pmyelo.mirror.aliyuncs.com"
140-
],
141131
"default-runtime": "nvidia",
142132
"runtimes": {
143133
"nvidia": {
@@ -146,7 +136,6 @@ b. 另外,节点名称不是服务器名称,是k8s node name,请用kubectl
146136
}
147137
}
148138
}
149-
150139
```
151140

152141
## 问题九:jeston 的 gpu 存在但是找不到 gpu 资源
@@ -163,25 +152,24 @@ b. 另外,节点名称不是服务器名称,是k8s node name,请用kubectl
163152
2024/01/04 07:43:58 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
164153
2024/01/04 07:43:58 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
165154
2024/01/04 07:43:58 No devices found. Waiting indefinitely.
166-
167155
```
168156

169157
![Q9-1](/img/FAQs/Q9-1.png)
170158

171159
```bash
172-
$ dpkg -l '*nvidia*'
160+
dpkg -l '*nvidia*'
173161
```
174162

175163
![Q9-2](/img/FAQs/Q9-2.png)
176164

177165
![Q9-3](/img/FAQs/Q9-3.png)
178166

179-
>[!quote]
167+
>
180168
>[Plug in does not detect Tegra device Jetson Nano · Issue #377 · NVIDIA/k8s-device-plugin (github.com)](https://github.com/NVIDIA/k8s-device-plugin/issues/377)
181169
>
182-
>Note that looking at the initial logs that you provided you may have been using `v1.7.0` of the NVIDIA Container Toolkit. This is quite an old version and we greatly improved our support for Tegra-based systems with the `v1.10.0` release. It should also be noted that in order to use the GPU Device Plugin on Tegra-based systems (specifically targetting the integrated GPUs) at least `v1.11.0` of the NVIDIA Container Toolkit is required.
170+
> Note that looking at the initial logs that you provided you may have been using `v1.7.0` of the NVIDIA Container Toolkit. This is quite an old version and we greatly improved our support for Tegra-based systems with the `v1.10.0` release. It should also be noted that in order to use the GPU Device Plugin on Tegra-based systems (specifically targetting the integrated GPUs) at least `v1.11.0` of the NVIDIA Container Toolkit is required.
183171
>
184-
>There are no Tegra-specific changes in the `v1.12.0` release, so using the `v1.11.0` release should be sufficient in this case.
172+
> There are no Tegra-specific changes in the `v1.12.0` release, so using the `v1.11.0` release should be sufficient in this case.
185173
186174
那么应该需要升级**NVIDIA Container Toolkit**
187175

@@ -573,7 +561,7 @@ kill xxxxx
573561
在部署kubeedge时,metrics-service参数中暴露的端口会被自动覆盖为10250端口,components.yaml文件中后续实际服务
574562
所在的端口一致。也可以手动修改参数中的端口为10250即可。
575563

576-
### 问题二十四:169.254.96. 16:53: i/o timeout
564+
## 问题二十四:169.254.96. 16:53: i/o timeout
577565

578566
集群新加入节点,KubeEdge的edgemesh以及sedna等组件会自动部署。查看lc的log会发现报错
579567

@@ -598,7 +586,7 @@ client tries to connect global manager(address: gm.sedna:9000) failed, error: di
598586

599587
**解决:** 配置后重启docker和edgecore即可。
600588

601-
### 问题二十五:边端join报错
589+
## 问题二十五:边端join报错
602590

603591
边端执行keadm join报错。
604592

0 commit comments

Comments
 (0)