PySpark执行来自不同进程的查询

huangapple go评论74阅读模式
英文:

PySpark executing queries from different processes

问题

有没有办法在Spark上运行两个独立的进程来执行查询?类似于以下代码:

def process_1():
   spark_context = SparkSession.builder.getOrCreate()
   data1 = spark_context.sql("SELECT * FROM table 1").toPandas()
   do_processing(data1)


def process_2():
   spark_context = SparkSession.builder.getOrCreate()
   data1 = spark_context.sql("SELECT * FROM table 2").toPandas()
   do_processing(data1)

p1 = Process(target=process_1)
p1.start()
p2 = Process(target=process_2)
p2.start()

p1.join()
p2.join()

问题是如何为不同的进程创建独立的SparkContext,或者如何在进程之间传递一个上下文?

英文:

Is there any way to have two separate processes executing queries on Spark? Something like:

def process_1():
   spark_context = SparkSession.builder.getOrCreate()
   data1 = spark_context.sql("SELECT * FROM table 1").toPandas()
   do_processing(data1)


def process_2():
   spark_context = SparkSession.builder.getOrCreate()
   data1 = spark_context.sql("SELECT * FROM table 2").toPandas()
   do_processing(data1)

p1 = Process(target=process_1)
p1.start()
p2 = Process(target=process_2)
p2.start()

p1.join()
p2.join()

The problem is how to create separate SparkContexts for processes or how to pass one context between processes?

答案1

得分: 1

PySpark 将其 Context 保持为单例对象。

只能在每个 JVM 中活动一个 SparkContext。在创建新的 SparkContext 之前,必须先 stop() 活动的 SparkContext

SparkContext 实例不支持在多个进程之间共享,PySpark 不保证多进程执行。相反,使用线程来进行并发处理。

至于您的 "内存不足" 问题(在您的代码中):这可能是由 DF.toPandas() 引起的,它会显著增加内存使用量。
考虑将加载的数据写入 parquet 文件,并使用 pyarrow 功能来优化计算。

英文:

PySpark holds its Context as a singleton object.

> Only one SparkContext should be active per JVM. You must stop()
> the active SparkContext before creating a new one.
>
> SparkContext instance is not supported to share across multiple
> processes out of the box, and PySpark does not guarantee
> multi-processing execution. Use threads instead for concurrent
> processing purpose.

As for your "out of memory" problem (in your code): that could be caused by DF.toPandas() which significantly increases memory usage.<br>
Consider writing the loaded data into parquet files and optimize computations with pyarrow functionality.

huangapple
  • 本文由 发表于 2023年1月9日 01:51:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/75050090.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定