如何修复k8s PGO运算符的网络问题和崩溃循环?

huangapple go评论90阅读模式
英文:

How to fix k8s PGO operator network issue and crash loop-back?

问题

我正在尝试按照此文档安装PGO操作员。当我运行以下命令时:

kubectl apply --server-side -k kustomize/install/default

我的Pod运行后不久就陷入了崩溃循环。

我已经做了什么
我使用以下命令检查了Pod的日志:

kubectl logs pgo-98c6b8888-fz8zj -n postgres-operator

结果

time="2023-01-09T07:50:56Z" level=debug msg="debug flag set to true" version=5.3.0-0
time="2023-01-09T07:51:26Z" level=error msg="Failed to get API Group-Resources" error="Get \"https://10.96.0.1:443/api?timeout=32s\": dial tcp 10.96.0.1:443: i/o timeout" version=5.3.0-0
panic: Get "https://10.96.0.1:443/api?timeout=32s": dial tcp 10.96.0.1:443: i/o timeout

goroutine 1 [running]:
main.assertNoError(...)
        github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:42
main.main()
        github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:84 +0x465

为了检查与主机的网络连接,我运行了以下命令:

wget https://10.96.0.1:443/api

结果如下

--2023-01-09 09:49:30--  https://10.96.0.1/api
Connecting to 10.96.0.1:443... connected.
ERROR: cannot verify 10.96.0.1's certificate, issued by ‘CN=kubernetes’:
  Unable to locally verify the issuer's authority.
To connect to 10.96.0.1 insecurely, use `--no-check-certificate'.

正如您所看到的,它已连接到API。

可能有用的奇怪问题
我运行了kubectl get pods --all-namespaces并看到了以下输出:

NAMESPACE           NAME                                                READY   STATUS             RESTARTS         AGE
kube-flannel        kube-flannel-ds-9gmmq                               1/1     Running            0                3d16h
kube-flannel        kube-flannel-ds-rcq8l                               0/1     CrashLoopBackOff   10 (3m15s ago)   34m
kube-flannel        kube-flannel-ds-rqwtj                               0/1     CrashLoopBackOff   10 (2m53s ago)   34m
kube-system         etcd-masterk8s-virtual-machine                      1/1     Running            1 (5d ago)       3d17h
kube-system         kube-apiserver-masterk8s-virtual-machine            1/1     Running            2 (5d ago)       3d17h
kube-system         kube-controller-manager-masterk8s-virtual-machine   1/1     Running            8 (2d ago)       3d17h
kube-system         kube-scheduler-masterk8s-virtual-machine            1/1     Running            7 (5d ago)       3d17h
postgres-operator   pgo-98c6b8888-fz8zj                                 0/1     CrashLoopBackOff   7 (4m59s ago)    20m

正如您所看到的,我的两个kube-flannel Pods也处于崩溃循环状态,其中一个正在运行。我不确定这是否是问题的主要原因。

我想要什么?
我希望成功运行PGO Pod,没有错误。

您如何帮助我?
请帮助我找出问题或获取详细的日志的任何其他方法。我无法找到问题的根本原因,因为如果是网络问题,那为什么它会连接?如果是其他问题,我该如何获取信息?

应用修复后的更新和新错误:

(更新部分的日志)
英文:

I am trying to install PGO operator by following this Docs. When I run this command

kubectl apply --server-side -k kustomize/install/default

my Pod run and soon it hit to crash loop back.

What I have done
I check the logs of Pods with this command

kubectl logs pgo-98c6b8888-fz8zj -n postgres-operator

Result

time="2023-01-09T07:50:56Z" level=debug msg="debug flag set to true" version=5.3.0-0
time="2023-01-09T07:51:26Z" level=error msg="Failed to get API Group-Resources" error="Get \"https://10.96.0.1:443/api?timeout=32s\": dial tcp 10.96.0.1:443: i/o timeout" version=5.3.0-0
panic: Get "https://10.96.0.1:443/api?timeout=32s": dial tcp 10.96.0.1:443: i/o timeout

goroutine 1 [running]:
main.assertNoError(...)
        github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:42
main.main()
        github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:84 +0x465

To check the network connection to host I run this command

wget https://10.96.0.1:443/api

The Result is

--2023-01-09 09:49:30--  https://10.96.0.1/api
Connecting to 10.96.0.1:443... connected.
ERROR: cannot verify 10.96.0.1's certificate, issued by ‘CN=kubernetes’:
  Unable to locally verify the issuer's authority.
To connect to 10.96.0.1 insecurely, use `--no-check-certificate'.

As you can see it is connected to API

Strange issue might be useful to help me

I run kubectl get pods --all-namespaces and see this output

NAMESPACE           NAME                                                READY   STATUS             RESTARTS         AGE
kube-flannel        kube-flannel-ds-9gmmq                               1/1     Running            0                3d16h
kube-flannel        kube-flannel-ds-rcq8l                               0/1     CrashLoopBackOff   10 (3m15s ago)   34m
kube-flannel        kube-flannel-ds-rqwtj                               0/1     CrashLoopBackOff   10 (2m53s ago)   34m
kube-system         etcd-masterk8s-virtual-machine                      1/1     Running            1 (5d ago)       3d17h
kube-system         kube-apiserver-masterk8s-virtual-machine            1/1     Running            2 (5d ago)       3d17h
kube-system         kube-controller-manager-masterk8s-virtual-machine   1/1     Running            8 (2d ago)       3d17h
kube-system         kube-scheduler-masterk8s-virtual-machine            1/1     Running            7 (5d ago)       3d17h
postgres-operator   pgo-98c6b8888-fz8zj                                 0/1     CrashLoopBackOff   7 (4m59s ago)    20m

As you can see my two kube-flannel Pods are also in crash loop-back and one is running. I am not sure if this is the main cause of this problem

What I want?
I want to run the PGO pod successfully with no error.

How you can help me?
Please help me to find the issue or any other way to get detailed logs. I am not able to find the root cause of this problem because,
If it was network issue then why its connected?
if its something else then how can I find the information?

Update and New errors after apply the fixes:

time="2023-01-09T11:57:47Z" level=debug msg="debug flag set to true" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Metrics server is starting to listen" addr=":8080" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="upgrade checking enabled" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="starting controller runtime manager and will wait for signal to exit" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting server" addr="[::]:8080" kind=metrics path=/metrics version=5.3.0-0
time="2023-01-09T11:57:47Z" level=debug msg="upgrade check issue: namespace not set" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1beta1.PostgresCluster" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.ConfigMap" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Endpoints" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.PersistentVolumeClaim" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Secret" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Service" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.ServiceAccount" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Deployment" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.StatefulSet" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Job" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Role" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.RoleBinding" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.CronJob" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.PodDisruptionBudget" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Pod" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.StatefulSet" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting Controller" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster version=5.3.0-0
W0109 11:57:48.006419       1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:57:48.006642       1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
time="2023-01-09T11:57:49Z" level=info msg="{\"pgo_versions\":[{\"tag\":\"v5.1.0\"},{\"tag\":\"v5.0.5\"},{\"tag\":\"v5.0.4\"},{\"tag\":\"v5.0.3\"},{\"tag\":\"v5.0.2\"},{\"tag\":\"v5.0.1\"},{\"tag\":\"v5.0.0\"}]}" X-Crunchy-Client-Metadata="{\"deployment_id\":\"288f4766-8617-479b-837f-2ee59ce2049a\",\"kubernetes_env\":\"v1.26.0\",\"pgo_clusters_total\":0,\"pgo_version\":\"5.3.0-0\",\"is_open_shift\":false}" version=5.3.0-0
W0109 11:57:49.163062       1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:57:49.163119       1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
W0109 11:57:51.404639       1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:57:51.404811       1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
W0109 11:57:54.749751       1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:57:54.750068       1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
W0109 11:58:06.015650       1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:58:06.015710       1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
W0109 11:58:25.355009       1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:58:25.355391       1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
W0109 11:59:10.447123       1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:59:10.447490       1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
time="2023-01-09T11:59:47Z" level=error msg="Could not wait for Cache to sync" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster error="failed to wait for postgrescluster caches to sync: timed out waiting for cache to be synced" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=info msg="Stopping and waiting for non leader election runnables" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=info msg="Stopping and waiting for leader election runnables" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=info msg="Stopping and waiting for caches" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=error msg="failed to get informer from cache" error="Timeout: failed waiting for *v1.PodDisruptionBudget Informer to sync" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=error msg="error received after stop sequence was engaged" error="context canceled" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=info msg="Stopping and waiting for webhooks" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=info msg="Wait completed, proceeding to shutdown the manager" version=5.3.0-0
panic: failed to wait for postgrescluster caches to sync: timed out waiting for cache to be synced
goroutine 1 [running]:
main.assertNoError(...)
github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:42
main.main()
github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:118 +0x434

答案1

得分: 1

如果这是一个新的部署,请建议使用v5

话虽如此,由于 PGO 管理了 Postgres 集群的网络(因此也管理了 listen_adresses),因此没有理由修改 listen_addresses 配置参数。如果您需要管理网络或网络访问,可以通过设置 pg_hba 配置或使用NetworkPolicies来实现。

请查看Custom 'listen_addresses' not applied #2904以获取更多信息。

CrashLoopBackOff: 检查 pod 日志以查找配置或部署问题,例如缺少依赖项(例如:Kubernetes 引擎不支持 docker-compose 的 depends-on,所以现在我们使用没有 nginx 的 Kubernetes + Docker),还要检查是否有 pod 被OOM杀死和资源使用过多。

检查超时问题以及关于超时问题的实验

ERROR: cannot verify 10.96.0.1's certificate, issued by ‘CN=kubernetes’:
Unable to locally verify the issuer's authority.
To connect to 10.96.0.1 insecurely, use `--no-check-certificate'.

尝试解决上述错误的方法:
首先,在存在此问题的每个主机上删除 ip link flannel.1

其次,从 k8s 中删除 kube-flannel-ds

最后,从 k8s 中重新创建 kube-flannel-ds,flannel.1 ip link 将被重新创建并恢复正常。

(为了使 flannel 正常工作,您必须将 --pod-network-cidr=10.244.0.0/16 传递给 kubeadm init。(我的意思是更改 CIDR)。

编辑:

请检查类似的问题和解决方案,这可能有助于解决您的问题。

英文:

If this is a new deployment, I suggest using v5.

That said, as PGO manages the networking for Postgres clusters (and as such, manages listen_adresses), there's no reason to modify the listen_addresses configuration parameter. If you need to manage networking or networking access, you can do that by setting the pg_hba config or using NetworkPolicies.

Please go through the Custom 'listen_addresses' not applied #2904 for more information.

CrashLoopBackOff: Check the pod logs for configuration or deployment issues such as missing dependencies (Like : kubernetes engine doesn't support docker-compose depends-on, so now we are using kubernetes + docker without nginx) and also check for pods being OOM killed and excessive resource usage.

Check for the timeout issues and also lab on timeout problem

ERROR: cannot verify 10.96.0.1's certificate, issued by ‘CN=kubernetes’:
Unable to locally verify the issuer's authority.
To connect to 10.96.0.1 insecurely, use `--no-check-certificate'.

Try solution for the above Error :
first, remove ip link flannel.1 on every hosts which has this problem

secondly, delete kube-flannel-ds from k8s

last, recreate kube-flannel-ds from k8s, flannel.1 ip link will recreated and return back good.

(For flannel to work correctly, you must pass --pod-network-cidr=10.244.0.0/16 to kubeadm init.(I mean Change CIDR).)

Edit :

Please check similar issue and solution ,which may help to resolve your issue.

huangapple
  • 本文由 发表于 2023年1月9日 16:09:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/75054559.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定