从特定数据集之后连接后选择所有列。

huangapple go评论52阅读模式
英文:

Selecting all columns from a specific dataset after Join

问题

以下是翻译好的部分:

我有以下的代码如何在连接之后仅选择df3中的所有列

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, col
df_dayOne = spark.read.option("header", True).csv("dbfs:/FileStore/tables/file1.csv")
windowSpec = Window.partitionBy("agency", "agencyid").orderBy(col("load_date").desc())
df2 = df_dayOne.withColumn("row_number", row_number().over(windowSpec))
df3 = df2.filter(df2.row_number == 1).drop("row_number")

现在我想要将数据集连接如下

df_join = df3.join(df0, df3.accountid ==  df0.accountid,"left")

现在我想要选择df3中的所有列 

.select("df3.*")

但是这会导致错误错误消息为具有名称`df3`.`agencyid`的列或函数参数”。

请提供建议
英文:

I have the following code. How can I select all the columns from df3 only after the join ?

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, col
df_dayOne = spark.read.option("header", True).csv("dbfs:/FileStore/tables/file1.csv")
windowSpec = Window.partitionBy("agency", "agencyid").orderBy(col("load_date").desc())
df2 = df_dayOne.withColumn("row_number",row_number().over(windowSpec))
df3 = df2.filter(df2.row_number == 1).drop("row_number") 

Now I want to join the datasets as:

df_join = df3.join(df0, df3.accountid ==  df0.accountid,"left")

Now I want to select all the columns from df3:-

.select("df3.*")

But this gives me an error saying "A column or function parameter with name df3.agencyid"

Please advise.

答案1

得分: 0

我建议使用别名。如果你有重复的列名,可能会遇到选择问题,这似乎是你的情况。使用别名,你可以像你描述的那样选择所有所需的列。

随着时间的推移,我默认在pyspark中的所有连接上设置别名 - 这只是让它更容易一些。

所以你的代码会像这样:

df_join = (
  df3.alias('a')
  .join(
    df0.alias('b'),
    col('a.accountid')==col('b.accountid'),
    'inner'
  )
  .select('a.*')
)
英文:

I would recommend working with aliases. You'll likely run into select issues, should you have duplicate column names - which appears to be the case. Using an alias, you can use select all desired columns just as you described.

Over time I've adopted by default setting aliases on all my joins in pyspark - it just makes it easier.

So your code would look something like this:

df_join = (
  df3.alias('a')
  .join(
    df0.alias('b'),
    col('a.accountid')==col('b.accountid'),
    'inner'
  )
  .select('a.*')
)

huangapple
  • 本文由 发表于 2023年7月17日 21:45:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76705081.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定