英文:
AWS EMR in private subnets
问题
我正在创建一个位于私有子网中的EMR集群,目前我正在努力使EMR集群能够正确创建。
我在所有的公有子网中都有NAT网关,我的私有子网路由表都有一条指向其AZ中的NAT网关的路由。关于EMR集群配置,我现在将所有可选项留空,以创建最简单的起始配置。
我正在使用一个VPC,其中有2个私有子网和一个公有子网,AWS为主/核心/任务创建了安全组。我的实例大小是m1.small。我观察到的行为如下:
实例创建过程开始,然后挂起约1小时,最终失败,并显示了一个晦涩的错误:
在主实例(i-01191fd75d02d1257)上,应用程序配置失败
除了显然的失败配置之外,我不知道这表示什么。我不想启动作业,我只想让主/核心节点正常运行,这个错误没有提供足够的信息来找出根本问题。
我在instance-controller.log
文件中看到以下错误:
AppPoller-Bg-Thread-2: 在重试尝试1/5之前延迟2秒,地址为http://10.0.152.217:8088/ws/v1/cluster/nodes的连接被拒绝(连接被拒绝)
但我不知道这是什么意思,也找不到关于它的信息。
bootstrap/master.log
文件包含以下内容:
2023-06-27 12:26:24,662 INFO i-01191fd75d02d1257: 启动新实例
2023-06-27 12:26:24,918 ERROR i-01191fd75d02d1257: 启动失败。引导操作1以非零退出代码失败。
2023-06-27 12:27:15,008 INFO i-093e926ba8e684d7c: 启动新实例
2023-06-27 12:27:20,833 INFO i-093e926ba8e684d7c: 所有引导操作已完成,实例已准备就绪
这表示它已完成引导,但查看此主节点的EMR控制台,它仍然显示为“正在引导”,所以似乎存在断开连接的情况。
我在/emr/instance-controller/log/hadoop-commands/
目录中看到以下错误:
报告:文件系统 file:/// 不是HDFS文件系统。文件系统类为:org.apache.hadoop.fs.LocalFileSystem
用法:hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] [-enteringmaintenance] [-inmaintenance]
EMR实例状态日志中包含以下消息:
随机休眠时间最长10分钟
英文:
I'm creating an EMR cluster in a private subnet and currently am struggling to get the EMR cluster to properly create.
I have NAT Gateways in all of my public subnets and my private subnet route tables all have a route to the NAT Gateway in their AZ. Regarding the EMR Cluster configuration, everything that is optional I leave blank right now to create the simplest starting configuration.
I am using a vpc with 2 private subnets and one public subnet, aws created security groups for primary/core/task. My instance sizes are m1.small. The behavior I am observing is as follows:
The instance creation process starts, and then hangs for around 1 hour before finally failing with a cryptic error of
On the master instance (i-01191fd75d02d1257), application provisioning failed
And I'm not sure what this indicates other than the obvious which is that it failed to provision. I don't want to start a job, I just want to get the primary/core nodes up and running and this error does not give me a lot to work with in figuring out my root issue.
I see the following error in the instance-controller.log
file:
AppPoller-Bg-Thread-2: Delay for 2 seconds before retry attempt 1/5 on http://10.0.152.217:8088/ws/v1/cluster/nodes
java.net.ConnectException: Connection refused (Connection refused)
But I'm not sure what this means and can't find information about it on google.
The bootstrap/master.log
file has the following:
2023-06-27 12:26:24,662 INFO i-01191fd75d02d1257: new instance started
2023-06-27 12:26:24,918 ERROR i-01191fd75d02d1257: failed to start. bootstrap action 1 failed with non-zero exit code.
2023-06-27 12:27:15,008 INFO i-093e926ba8e684d7c: new instance started
2023-06-27 12:27:20,833 INFO i-093e926ba8e684d7c: all bootstrap actions complete and instance ready
which would indicate that it's finished bootstrapping, but looking at the EMR console for this primary node and it still says Bootstrapping
so there seems to be a disconnect here.
I see this error in the /emr/instance-controller/log/hadoop-commands/
directory:
report: FileSystem file:/// is not an HDFS file system. The fs class is: org.apache.hadoop.fs.LocalFileSystem
Usage: hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] [-enteringmaintenance] [-inmaintenance]
The EMR instance state log has the following message:
Sleeping for a random period of time up to 10 minutes
答案1
得分: 0
尝试使用m6a.xlarge(第三代AMD EPYC处理器),如果需要使用x86架构的软件/库;或者m7g.xlarge(AWS Graviton3处理器),而不是M1型号,它是上一代。
https://aws.amazon.com/ec2/instance-types/
英文:
Try the m6a.xlarge (3rd generation AMD EPYC processor) if you need any software/libraries requiring x86 architecture; or m7g.xlarge (AWS Graviton3 processors); instead of the M1 type which is a previous generation.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论