英文:
How to fix k8s PGO operator network issue and crash loop-back?
问题
我正在尝试按照此文档安装PGO操作员。当我运行以下命令时:
kubectl apply --server-side -k kustomize/install/default
我的Pod运行后不久就陷入了崩溃循环。
我已经做了什么
我使用以下命令检查了Pod的日志:
kubectl logs pgo-98c6b8888-fz8zj -n postgres-operator
结果
time="2023-01-09T07:50:56Z" level=debug msg="debug flag set to true" version=5.3.0-0
time="2023-01-09T07:51:26Z" level=error msg="Failed to get API Group-Resources" error="Get \"https://10.96.0.1:443/api?timeout=32s\": dial tcp 10.96.0.1:443: i/o timeout" version=5.3.0-0
panic: Get "https://10.96.0.1:443/api?timeout=32s": dial tcp 10.96.0.1:443: i/o timeout
goroutine 1 [running]:
main.assertNoError(...)
github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:42
main.main()
github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:84 +0x465
为了检查与主机的网络连接,我运行了以下命令:
wget https://10.96.0.1:443/api
结果如下
--2023-01-09 09:49:30-- https://10.96.0.1/api
Connecting to 10.96.0.1:443... connected.
ERROR: cannot verify 10.96.0.1's certificate, issued by ‘CN=kubernetes’:
Unable to locally verify the issuer's authority.
To connect to 10.96.0.1 insecurely, use `--no-check-certificate'.
正如您所看到的,它已连接到API。
可能有用的奇怪问题
我运行了kubectl get pods --all-namespaces
并看到了以下输出:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-flannel kube-flannel-ds-9gmmq 1/1 Running 0 3d16h
kube-flannel kube-flannel-ds-rcq8l 0/1 CrashLoopBackOff 10 (3m15s ago) 34m
kube-flannel kube-flannel-ds-rqwtj 0/1 CrashLoopBackOff 10 (2m53s ago) 34m
kube-system etcd-masterk8s-virtual-machine 1/1 Running 1 (5d ago) 3d17h
kube-system kube-apiserver-masterk8s-virtual-machine 1/1 Running 2 (5d ago) 3d17h
kube-system kube-controller-manager-masterk8s-virtual-machine 1/1 Running 8 (2d ago) 3d17h
kube-system kube-scheduler-masterk8s-virtual-machine 1/1 Running 7 (5d ago) 3d17h
postgres-operator pgo-98c6b8888-fz8zj 0/1 CrashLoopBackOff 7 (4m59s ago) 20m
正如您所看到的,我的两个kube-flannel Pods也处于崩溃循环状态,其中一个正在运行。我不确定这是否是问题的主要原因。
我想要什么?
我希望成功运行PGO Pod,没有错误。
您如何帮助我?
请帮助我找出问题或获取详细的日志的任何其他方法。我无法找到问题的根本原因,因为如果是网络问题,那为什么它会连接?如果是其他问题,我该如何获取信息?
应用修复后的更新和新错误:
(更新部分的日志)
英文:
I am trying to install PGO operator by following this Docs. When I run this command
kubectl apply --server-side -k kustomize/install/default
my Pod run and soon it hit to crash loop back.
What I have done
I check the logs of Pods with this command
kubectl logs pgo-98c6b8888-fz8zj -n postgres-operator
Result
time="2023-01-09T07:50:56Z" level=debug msg="debug flag set to true" version=5.3.0-0
time="2023-01-09T07:51:26Z" level=error msg="Failed to get API Group-Resources" error="Get \"https://10.96.0.1:443/api?timeout=32s\": dial tcp 10.96.0.1:443: i/o timeout" version=5.3.0-0
panic: Get "https://10.96.0.1:443/api?timeout=32s": dial tcp 10.96.0.1:443: i/o timeout
goroutine 1 [running]:
main.assertNoError(...)
github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:42
main.main()
github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:84 +0x465
To check the network connection to host I run this command
wget https://10.96.0.1:443/api
The Result is
--2023-01-09 09:49:30-- https://10.96.0.1/api
Connecting to 10.96.0.1:443... connected.
ERROR: cannot verify 10.96.0.1's certificate, issued by ‘CN=kubernetes’:
Unable to locally verify the issuer's authority.
To connect to 10.96.0.1 insecurely, use `--no-check-certificate'.
As you can see it is connected to API
Strange issue might be useful to help me
I run kubectl get pods --all-namespaces
and see this output
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-flannel kube-flannel-ds-9gmmq 1/1 Running 0 3d16h
kube-flannel kube-flannel-ds-rcq8l 0/1 CrashLoopBackOff 10 (3m15s ago) 34m
kube-flannel kube-flannel-ds-rqwtj 0/1 CrashLoopBackOff 10 (2m53s ago) 34m
kube-system etcd-masterk8s-virtual-machine 1/1 Running 1 (5d ago) 3d17h
kube-system kube-apiserver-masterk8s-virtual-machine 1/1 Running 2 (5d ago) 3d17h
kube-system kube-controller-manager-masterk8s-virtual-machine 1/1 Running 8 (2d ago) 3d17h
kube-system kube-scheduler-masterk8s-virtual-machine 1/1 Running 7 (5d ago) 3d17h
postgres-operator pgo-98c6b8888-fz8zj 0/1 CrashLoopBackOff 7 (4m59s ago) 20m
As you can see my two kube-flannel Pods are also in crash loop-back and one is running. I am not sure if this is the main cause of this problem
What I want?
I want to run the PGO pod successfully with no error.
How you can help me?
Please help me to find the issue or any other way to get detailed logs. I am not able to find the root cause of this problem because,
If it was network issue then why its connected?
if its something else then how can I find the information?
Update and New errors after apply the fixes:
time="2023-01-09T11:57:47Z" level=debug msg="debug flag set to true" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Metrics server is starting to listen" addr=":8080" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="upgrade checking enabled" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="starting controller runtime manager and will wait for signal to exit" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting server" addr="[::]:8080" kind=metrics path=/metrics version=5.3.0-0
time="2023-01-09T11:57:47Z" level=debug msg="upgrade check issue: namespace not set" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1beta1.PostgresCluster" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.ConfigMap" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Endpoints" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.PersistentVolumeClaim" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Secret" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Service" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.ServiceAccount" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Deployment" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.StatefulSet" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Job" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Role" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.RoleBinding" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.CronJob" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.PodDisruptionBudget" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.Pod" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting EventSource" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster source="kind source: *v1.StatefulSet" version=5.3.0-0
time="2023-01-09T11:57:47Z" level=info msg="Starting Controller" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster version=5.3.0-0
W0109 11:57:48.006419 1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:57:48.006642 1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
time="2023-01-09T11:57:49Z" level=info msg="{\"pgo_versions\":[{\"tag\":\"v5.1.0\"},{\"tag\":\"v5.0.5\"},{\"tag\":\"v5.0.4\"},{\"tag\":\"v5.0.3\"},{\"tag\":\"v5.0.2\"},{\"tag\":\"v5.0.1\"},{\"tag\":\"v5.0.0\"}]}" X-Crunchy-Client-Metadata="{\"deployment_id\":\"288f4766-8617-479b-837f-2ee59ce2049a\",\"kubernetes_env\":\"v1.26.0\",\"pgo_clusters_total\":0,\"pgo_version\":\"5.3.0-0\",\"is_open_shift\":false}" version=5.3.0-0
W0109 11:57:49.163062 1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:57:49.163119 1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
W0109 11:57:51.404639 1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:57:51.404811 1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
W0109 11:57:54.749751 1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:57:54.750068 1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
W0109 11:58:06.015650 1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:58:06.015710 1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
W0109 11:58:25.355009 1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:58:25.355391 1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
W0109 11:59:10.447123 1 reflector.go:324] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0109 11:59:10.447490 1 reflector.go:138] k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:postgres-operator:pgo" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
time="2023-01-09T11:59:47Z" level=error msg="Could not wait for Cache to sync" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster error="failed to wait for postgrescluster caches to sync: timed out waiting for cache to be synced" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=info msg="Stopping and waiting for non leader election runnables" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=info msg="Stopping and waiting for leader election runnables" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=info msg="Stopping and waiting for caches" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=error msg="failed to get informer from cache" error="Timeout: failed waiting for *v1.PodDisruptionBudget Informer to sync" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=error msg="error received after stop sequence was engaged" error="context canceled" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=info msg="Stopping and waiting for webhooks" version=5.3.0-0
time="2023-01-09T11:59:47Z" level=info msg="Wait completed, proceeding to shutdown the manager" version=5.3.0-0
panic: failed to wait for postgrescluster caches to sync: timed out waiting for cache to be synced
goroutine 1 [running]:
main.assertNoError(...)
github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:42
main.main()
github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:118 +0x434
答案1
得分: 1
如果这是一个新的部署,请建议使用v5。
话虽如此,由于 PGO 管理了 Postgres 集群的网络(因此也管理了 listen_adresses),因此没有理由修改 listen_addresses 配置参数。如果您需要管理网络或网络访问,可以通过设置 pg_hba 配置或使用NetworkPolicies来实现。
请查看Custom 'listen_addresses' not applied #2904以获取更多信息。
CrashLoopBackOff: 检查 pod 日志以查找配置或部署问题,例如缺少依赖项(例如:Kubernetes 引擎不支持 docker-compose 的 depends-on,所以现在我们使用没有 nginx 的 Kubernetes + Docker),还要检查是否有 pod 被OOM杀死和资源使用过多。
ERROR: cannot verify 10.96.0.1's certificate, issued by ‘CN=kubernetes’:
Unable to locally verify the issuer's authority.
To connect to 10.96.0.1 insecurely, use `--no-check-certificate'.
尝试解决上述错误的方法:
首先,在存在此问题的每个主机上删除 ip link flannel.1
其次,从 k8s 中删除 kube-flannel-ds
最后,从 k8s 中重新创建 kube-flannel-ds,flannel.1 ip link 将被重新创建并恢复正常。
(为了使 flannel 正常工作,您必须将 --pod-network-cidr=10.244.0.0/16
传递给 kubeadm init。(我的意思是更改 CIDR)。
编辑:
请检查类似的问题和解决方案,这可能有助于解决您的问题。
英文:
If this is a new deployment, I suggest using v5.
That said, as PGO manages the networking for Postgres clusters (and as such, manages listen_adresses), there's no reason to modify the listen_addresses configuration parameter. If you need to manage networking or networking access, you can do that by setting the pg_hba config or using NetworkPolicies.
Please go through the Custom 'listen_addresses' not applied #2904 for more information.
CrashLoopBackOff: Check the pod logs for configuration or deployment issues such as missing dependencies (Like : kubernetes engine doesn't support docker-compose depends-on, so now we are using kubernetes + docker without nginx) and also check for pods being OOM killed and excessive resource usage.
Check for the timeout issues and also lab on timeout problem
ERROR: cannot verify 10.96.0.1's certificate, issued by ‘CN=kubernetes’:
Unable to locally verify the issuer's authority.
To connect to 10.96.0.1 insecurely, use `--no-check-certificate'.
Try solution for the above Error :
first, remove ip link flannel.1 on every hosts which has this problem
secondly, delete kube-flannel-ds from k8s
last, recreate kube-flannel-ds from k8s, flannel.1 ip link will recreated and return back good.
(For flannel to work correctly, you must pass --pod-network-cidr=10.244.0.0/16
to kubeadm init.(I mean Change CIDR).)
Edit :
Please check similar issue and solution ,which may help to resolve your issue.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论