英文:
PySpark executing queries from different processes
问题
有没有办法在Spark上运行两个独立的进程来执行查询?类似于以下代码:
def process_1():
spark_context = SparkSession.builder.getOrCreate()
data1 = spark_context.sql("SELECT * FROM table 1").toPandas()
do_processing(data1)
def process_2():
spark_context = SparkSession.builder.getOrCreate()
data1 = spark_context.sql("SELECT * FROM table 2").toPandas()
do_processing(data1)
p1 = Process(target=process_1)
p1.start()
p2 = Process(target=process_2)
p2.start()
p1.join()
p2.join()
问题是如何为不同的进程创建独立的SparkContext,或者如何在进程之间传递一个上下文?
英文:
Is there any way to have two separate processes executing queries on Spark? Something like:
def process_1():
spark_context = SparkSession.builder.getOrCreate()
data1 = spark_context.sql("SELECT * FROM table 1").toPandas()
do_processing(data1)
def process_2():
spark_context = SparkSession.builder.getOrCreate()
data1 = spark_context.sql("SELECT * FROM table 2").toPandas()
do_processing(data1)
p1 = Process(target=process_1)
p1.start()
p2 = Process(target=process_2)
p2.start()
p1.join()
p2.join()
The problem is how to create separate SparkContexts for processes or how to pass one context between processes?
答案1
得分: 1
PySpark 将其 Context 保持为单例对象。
只能在每个 JVM 中活动一个 SparkContext
。在创建新的 SparkContext
之前,必须先 stop()
活动的 SparkContext
。
SparkContext
实例不支持在多个进程之间共享,PySpark 不保证多进程执行。相反,使用线程来进行并发处理。
至于您的 "内存不足" 问题(在您的代码中):这可能是由 DF.toPandas()
引起的,它会显著增加内存使用量。
考虑将加载的数据写入 parquet 文件,并使用 pyarrow 功能来优化计算。
英文:
PySpark holds its Context as a singleton object.
> Only one SparkContext
should be active per JVM. You must stop()
> the active SparkContext
before creating a new one.
>
> SparkContext
instance is not supported to share across multiple
> processes out of the box, and PySpark does not guarantee
> multi-processing execution. Use threads instead for concurrent
> processing purpose.
As for your "out of memory" problem (in your code): that could be caused by DF.toPandas()
which significantly increases memory usage.<br>
Consider writing the loaded data into parquet files and optimize computations with pyarrow functionality.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论