英文:
Selecting all columns from a specific dataset after Join
问题
以下是翻译好的部分:
我有以下的代码。如何在连接之后仅选择df3中的所有列?
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, col
df_dayOne = spark.read.option("header", True).csv("dbfs:/FileStore/tables/file1.csv")
windowSpec = Window.partitionBy("agency", "agencyid").orderBy(col("load_date").desc())
df2 = df_dayOne.withColumn("row_number", row_number().over(windowSpec))
df3 = df2.filter(df2.row_number == 1).drop("row_number")
现在我想要将数据集连接如下:
df_join = df3.join(df0, df3.accountid == df0.accountid,"left")
现在我想要选择df3中的所有列:
.select("df3.*")
但是这会导致错误,错误消息为“具有名称`df3`.`agencyid`的列或函数参数”。
请提供建议。
英文:
I have the following code. How can I select all the columns from df3 only after the join ?
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, col
df_dayOne = spark.read.option("header", True).csv("dbfs:/FileStore/tables/file1.csv")
windowSpec = Window.partitionBy("agency", "agencyid").orderBy(col("load_date").desc())
df2 = df_dayOne.withColumn("row_number",row_number().over(windowSpec))
df3 = df2.filter(df2.row_number == 1).drop("row_number")
Now I want to join the datasets as:
df_join = df3.join(df0, df3.accountid == df0.accountid,"left")
Now I want to select all the columns from df3:-
.select("df3.*")
But this gives me an error saying "A column or function parameter with name df3
.agencyid
"
Please advise.
答案1
得分: 0
我建议使用别名。如果你有重复的列名,可能会遇到选择问题,这似乎是你的情况。使用别名,你可以像你描述的那样选择所有所需的列。
随着时间的推移,我默认在pyspark中的所有连接上设置别名 - 这只是让它更容易一些。
所以你的代码会像这样:
df_join = (
df3.alias('a')
.join(
df0.alias('b'),
col('a.accountid')==col('b.accountid'),
'inner'
)
.select('a.*')
)
英文:
I would recommend working with aliases. You'll likely run into select issues, should you have duplicate column names - which appears to be the case. Using an alias, you can use select all desired columns just as you described.
Over time I've adopted by default setting aliases on all my joins in pyspark - it just makes it easier.
So your code would look something like this:
df_join = (
df3.alias('a')
.join(
df0.alias('b'),
col('a.accountid')==col('b.accountid'),
'inner'
)
.select('a.*')
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论