问题

我试图使用Python的多进程来在同一个AWS Glue 4.0作业中并行处理数据。我知道我可以使用多个作业的Glue工作流来实现并行数据处理，但由于与此无关的原因，这是我不想做的事情。

这是我的Python代码：

从第3行到第81行是Python代码，略去

不幸的是，尽管它似乎正确启动了多个工作进程，但它卡住了，直到Glue作业最终超时。

这是我在CloudWatch输出日志中看到的。错误日志中没有错误。

2023-04-19T12:01:49.566+02:00	STARTED WORKER: TABLE1 LOADING: TABLE1
2023-04-19T12:01:49.566+02:00	STARTED WORKER: TABLE2 LOADING: TABLE2
2023-04-19T12:01:49.566+02:00	STARTED WORKER: TABLE3 LOADING: TABLE3 
2023-04-19T12:01:49.566+02:00   STARTED WORKER: TABLE4 LOADING: TABLE4
2023-04-19T12:01:49.603+02:00	STARTED WORKER: TABLE5 LOADING: TABLE5
2023-04-19T12:01:49.604+02:00	STARTED WORKER: TABLE6 LOADING: TABLE6
2023-04-19T12:01:49.607+02:00	STARTED WORKER: TABLE7 LOADING: TABLE7
2023-04-19T12:01:49.608+02:00	STARTED WORKER: TABLE8 LOADING: TABLE8
2023-04-19T12:01:49.609+02:00	STARTED WORKER: TABLE9 LOADING: TABLE9

我尝试了几种方法，但我无法准确理解问题是什么，除了似乎卡在create_dynamic_frame.from_catalog()上。

有人尝试过类似的操作并解决了吗？
为什么它不起作用？

非常感谢！

英文:

I am trying to use Python Multiprocessing to process data in parallel within the same AWS Glue 4.0 job. I know that I could use Glue Workflows with multiple jobs to achieve parallel data processing, but for reasons that are irrelevant here, it is something that I don't want to do.

This is my Python code:

from multiprocessing import Pool
import sys
import time
import random

from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext

args = getResolvedOptions(sys.argv, [&#39;JOB_NAME&#39;, &#39;TempDir&#39;])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args[&#39;JOB_NAME&#39;], args)
print(f&quot;{args[&#39;JOB_NAME&#39;]} STARTED&quot;)

def worker(table_name, tmp_dir):
    print(f&quot;STARTED WORKER: {table_name}&quot;)
    data = load_data(table_name, tmp_dir)
    process_data(table_name, data)
    print(f&quot;FINISHED WORKER: {table_name}&quot;)
    
def load_data(table_name, tmp_dir):    
    print(f&quot;LOADING: {table_name}&quot;)
    data = glueContext.create_dynamic_frame.from_catalog(database=&quot;my_database&quot;,
                                                         table_name=table_name,
                                                         redshift_tmp_dir=f&quot;{tmp_dir}/{table_name}&quot;,
                                                         transformation_ctx=f&quot;data_source_{table_name}&quot;)
    time.sleep(random.randint(1, 5))  # added here to simulate different loading times
    print(f&quot;LOADED: {table_name} has {data.count()} rows&quot;)
    return data

def process_data(table_name, data):
    print(f&quot;PROCESSING: {table_name}&quot;)
    # do something
    time.sleep(random.randint(1, 5))  # added here to simulate different processing times
    print(f&quot;DONE: {table_name}&quot;)

pool = Pool(4)
tables = [&#39;TABLE1&#39;, &#39;TABLE2&#39;, &#39;TABLE3&#39;, &#39;TABLE4&#39;, &#39;TABLE5&#39;, &#39;TABLE6&#39;, &#39;TABLE7&#39;, &#39;TABLE8&#39;, &#39;TABLE9&#39;]
for table in tables:
    pool.apply_async(worker, args=(table, args[&#39;TempDir&#39;]))
pool.close()
pool.join()

print(f&quot;{args[&#39;JOB_NAME&#39;]} COMPLETED&quot;)
job.commit()

Unfortunately, while it seems to start multiple workers correctly, it hangs and never completes until the Glue job finally times out.

This is what I see in the CloudWatch output log. There are no errors in the error log.

2023-04-19T12:01:49.566+02:00	STARTED WORKER: TABLE1 LOADING: TABLE1
2023-04-19T12:01:49.566+02:00	STARTED WORKER: TABLE2 LOADING: TABLE2
2023-04-19T12:01:49.566+02:00	STARTED WORKER: TABLE3 LOADING: TABLE3 
2023-04-19T12:01:49.566+02:00   STARTED WORKER: TABLE4 LOADING: TABLE4
2023-04-19T12:01:49.603+02:00	STARTED WORKER: TABLE5 LOADING: TABLE5
2023-04-19T12:01:49.604+02:00	STARTED WORKER: TABLE6 LOADING: TABLE6
2023-04-19T12:01:49.607+02:00	STARTED WORKER: TABLE7 LOADING: TABLE7
2023-04-19T12:01:49.608+02:00	STARTED WORKER: TABLE8 LOADING: TABLE8
2023-04-19T12:01:49.609+02:00	STARTED WORKER: TABLE9 LOADING: TABLE9

I have tried several things, but I cannot understand exactly what the problem is, except that it seems to be hanging on create_dynamic_frame.from_catalog().

Has anybody attempted to do the same and solved it?
Why doesn't it work?

Thank you in advance!

答案1

得分: 1

经过多次尝试和添加额外的调试信息和异常处理，我发现 Python 的 multiprocessing 与 AWS Glue 不兼容。我从 create_dynamic_frame.from_catalog() 得到的错误是 JsonOptions does not exist in the JVM，无法继续进行。

然而，将 multiprocessing.Pool() 替换为 concurrent.futures.ThreadPoolExecutor() 可以正常工作，现在我可以在同一个 Glue 作业中运行并行进程。

英文:

After several attempts and adding additional debugging information and exceptions handling, I found out that Python's multiprocessing doesn't work with AWS Glue. The error I got from create_dynamic_frame.from_catalog() is JsonOptions does not exist in the JVM and couldn't go any further.

However, replacing multiprocessing.Pool() with concurrent.futures.ThreadPoolExecutor() worked and I can now run parallel processes within the same Glue job.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python多进程在同一个AWS Glue 4.0作业中卡住

问题

答案1

使用shell_exec()在PHP代码中运行Python脚本时出现问题。

如何通过向现有列表追加元素在Python中创建嵌套列表

如何从图表中获取所有X范围的坐标？

How do you update text in an event to be a three number RGB color (ex. 100,100,100) in pysimplegui?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论