Running spark-client snap, executor pod won't start up on specific node

huangapple go评论56阅读模式
英文:

Running spark-client snap, executor pod won't start up on specific node

问题

I'm running microk8s(通过 snap install microk8s --classic 安装)在一个本地的多节点集群上,还有 spark-client(通过 snap install spark-client --edge 安装)。其中两个节点是运行在 Windows 11 上的 WSL2(Ubuntu)。我现在正在添加一个运行 Ubuntu 的笔记本电脑节点。当我尝试运行 spark-client.spark-shell ... 时,它会成功在两个 WSL 节点上启动执行器,但在新的笔记本电脑节点上失败了。我知道笔记本电脑节点能够成功运行 Pod,因为一个 hdfs Pod 在那里成功运行。

spark-shell 失败时,它立即删除并创建一个新的 Pod,因此很难看到错误信息。我成功捕获了一个日志,但只有一行:
error: unknown command "executor", see 'pebble help'.

我注意到在这些 Pod 的配置中有一个参数:executor,所以这可能是出错的原因。但为什么一个节点会启动不同呢?

镜像是:ghcr.io/canonical/charmed-spark:3.4.0-22.04_edge。我成功地直接运行了它。

有关如何解决或进一步排除此问题的任何想法吗?

注意:我看到这里有另外两个类似的问题,但它们没有这个特定的错误消息,所以我认为这个问题是独特的。

更新:我刚刚注意到在节点详细信息中,该镜像的 sha256 在节点之间是不同的。

英文:

I'm running microk8s (installed via snap install microk8s --classic) across a local multinode cluster and spark-client (installed via snap install spark-client --edge). Two of the nodes are WSL2 (Ubuntu) on Windows 11. I'm now adding a laptop running Ubuntu node. When I try to run spark-client.spark-shell ..., it will spin up executors successfully on the two WSL nodes, but it fails on the new laptop node. I know the laptop node is capable of running pods successfully because an hdfs pod is successfully running there.

spark-shell immediately deletes and creates a new pod when it fails, so it's hard to see the error information. I was able to capture a log and it was only one line:
error: unknown command "executor", see 'pebble help'.

I notice that in the configuration for those pods there is an argument: executor, so that might be where that's coming from. But why would one node start up differently?

The image is: ghcr.io/canonical/charmed-spark:3.4.0-22.04_edge . I was able to run it directly.

Any ideas on how to resolve or further troubleshoot this?

Note: I did see 2 other questions here that are similar, but they do not have this particular error message, so I think this question is distinct.

Update: I just noticed in the node details, the sha256 is different across the nodes for that image.

答案1

得分: 1

好消息和坏消息...

我删除了所有节点上的图像:

microk8s.ctr images delete ghcr.io/canonical/charmed-spark:3.4.0-22.04_edge

以强制它拉取最新版本。

现在所有节点的行为都一样。它们都失败了。我会假设上周引入了一些错误。最新的边缘版本是上周(6/8),在我为旧节点拉取图像之后,但在我为新节点拉取之前。

谜题解决了,尽管没有解决方案,因为我无法通过 snap 拉取旧版本,因为它在相同的通道上。我会找其他东西来使用。

已提交错误:https://github.com/canonical/spark-client-snap/issues/68
在错误讨论中有一个立即可用的解决方法,尽管听起来他们很快会发布修复。

英文:

Good news and bad news...

I deleted the images on all the nodes:

microk8s.ctr images delete ghcr.io/canonical/charmed-spark:3.4.0-22.04_edge

to force it to pull the latest.

Now all the nodes behave the same. They all fail. I'll assume that some bug was introduced last week. The latest edge release was last week (6/8), after I pulled the image for the old nodes, but before I pulled for the new node.

Mystery solved, though no solution, because I can't pull an older version via snap because it's on the same channel. I'll find something else to use.

Filed bug: https://github.com/canonical/spark-client-snap/issues/68
There's an immediate workaround in the bug conversation, though it sounds like they'll have a fix out soon.

huangapple
  • 本文由 发表于 2023年6月15日 03:35:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76477006.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定