PySpark:从现有数据框架中查询特定一组ID的最有效方法

huangapple go评论100阅读模式
英文:

PySpark: Most efficient way to query from DB for a specific set of ids from an existing data frame

问题

我有一个PySpark dataframe,其中包含这些字段:ID1、ID2和DATE。对于每个ID1、ID2和DATE,我需要编写一个SQL语句来提取一个新字段,其中包括ID1、ID2和DATE在WHERE子句中。

在编写这个代码的时候,最高效的方式是什么?

我目前的想法是从第一个PySpark dataframe中提取ID1、ID2和DATE字段,并将它们存储在某种列表中,然后循环遍历每个3个字段的组合(ID1、ID2、DATE),在循环内部使用SQL语句。我想知道是否有一种方法可以只执行一个SQL语句,以处理原始PySpark dataframe中的所有ID1、ID2和DATE字段。

英文:

I have a PySpark dataframe with these fields: ID1, ID2 and DATE. For each ID1, ID2, and DATE I need to write an SQL statement to extract a new field where I include ID1, ID2 and DATE in the WHERE clause.

What's the most efficient way to code this?

My idea at the moment is to extract ID1, ID2 and DATE fields from the first PySpark dataframe in a list of some sort, and then loop through each 3-pair (ID1, ID2, DATE) and have a SQL statement within the loop. I'm curious if there is a way to only execute 1 SQL statement that does it for all ID1, ID2 and DATE fields from the original PySpark dataframe.

答案1

得分: 1

不必为数据帧的每一行生成一个SQL查询,而是为了优化Spark作业,建议使用Spark之间的连接查询,连接的目标表也可以读取为数据帧,基于你提到的3个字段(ID1、ID2和DATE)进行连接查询。

使用这种方法会使Spark作业变得更加高效,因为:

  • 无需收集(将所有数据以列表形式发送到Spark驱动程序),因此所有转换都将由Spark执行器以并行方式执行。
  • 你只需运行一个连接查询,而不是与你的第一个数据帧包含的行数一样多。

示例代码如下:

  1. joined_df = df1.join(df2, on=["ID1", "ID2", "DATE"], how="inner")
英文:

Instead of generating an SQL query for each of your dataframe lines, why don't you just do a join query using Spark between your dataframe and the target table (that you can read as a dataframe too) based on the 3 fields you mentionned (ID1, ID2 and DATE).

Using such method will make Spark job optimal since:

  • You won't need to collect (send all your data into the Spark driver in form of a list) so all your transformations will be executed by the Spark executors in a parallel manner
  • You'll run only one join query instead of as much rows your first dataframe contains

Example here:

  1. joined_df = df1.join(df2, on=["ID1", "ID2", "DATE"], how="inner")

huangapple
  • 本文由 发表于 2023年5月25日 14:50:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76329584.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定