在另一个表中查找数值,如果找不到,则返回空白/Databricks,pyspark。

huangapple go评论140阅读模式
英文:

Dataframe: how to lookup values in another table, if not found, return blank/Databricks, pyspark

问题

我在Databricks中工作,根据两个现有表格创建数据框。我需要查找第二个表中的值,如果找到,则返回该值,否则返回空白。

例如:
df1:

Material KG
1 10
2 50
3 100
4 80
5 60

df2:

Material Conversion
1 1.5
3 5
5 12

我想要得到下面的df3:

Material KG Conversion
1 10 1.5
2 50
3 100 5
4 80
5 60 12

我使用以下方法进行添加,但它只连接在df2中找到的那些,所以它只显示材料1、3、5,删除了材料2、4,我仍然想保留1-5:

df3 = df1.join(df2,["Material"]).withColumnRenamed("Material", "Material_new").drop("Material_new")
display(df3)

英文:

I'm working in Databricks to create data frame based on two existing tables. I need to look up values in 2nd table, return the value, if not found, then return blank.
for example:
df1:

Material KG
1 10
2 50
3 100
4 80
5 60

df2:

Material Conversion
1 1.5
3 5
5 12

I'd like to have the below df3

Material KG Conversion
1 10 1.5
2 50
3 100 5
4 80
5 60 12

I used the below to add, but it join only those found in df2, so it only shows material 135 and removed Material 24, I still want to keep 1-5
df3 = df1.join(df2,["Material"]).withColumnRenamed("Material", "Material_new").drop("Material_new")
display(df3)

I'd like to have the below df3

Material KG Conversion
1 10 1.5
2 50
3 100 5
4 80
5 60 12

答案1

得分: 0

尝试使用**left**连接。

示例:

df = spark.createDataFrame([(1,10),(2,50),(3,100),(4,80),(5,60)],['Material','KG'])
df1 = spark.createDataFrame([(1,1.5),(3,5.0),(5,12.0)],['Material','Conversion'])
df.join(df1,['Material'],'left').show(10,False)
#+--------+---+----------+
#|Material|KG |Conversion|
#+--------+---+----------+
#|1       |10 |1.5       |
#|2       |50 |null      |
#|3       |100|5.0       |
#|4       |80 |null      |
#|5       |60 |12.0      |
#+--------+---+----------+

null替换为' '

from pyspark.sql.functions import *
df = spark.createDataFrame([(1,10),(2,50),(3,100),(4,80),(5,60)],['Material','KG'])
df1 = spark.createDataFrame([(1,1.5),(3,5.0),(5,12.0)],['Material','Conversion'])
df.join(df1,['Material'],'left').\
  withColumn("conversion",when(col("conversion").isNull(),lit(" ")).\
    otherwise(col("conversion"))).\
      show(10,False)
#+--------+---+----------+
#|Material|KG |conversion|
#+--------+---+----------+
#|1       |10 |1.5       |
#|2       |50 |          |
#|3       |100|5.0       |
#|4       |80 |          |
#|5       |60 |12.0      |
#+--------+---+----------+
英文:

Try with left join for this case.

Example:

df = spark.createDataFrame([(1,10),(2,50),(3,100),(4,80),(5,60)],['Material','KG'])
df1 = spark.createDataFrame([(1,1.5),(3,5.0),(5,12.0)],['Material','Conversion'])
df.join(df1,['Material'],'left').show(10,False)
#+--------+---+----------+
#|Material|KG |Conversion|
#+--------+---+----------+
#|1       |10 |1.5       |
#|2       |50 |null      |
#|3       |100|5.0       |
#|4       |80 |null      |
#|5       |60 |12.0      |
#+--------+---+----------+

Replace null with ' '

from pyspark.sql.functions import *
df = spark.createDataFrame([(1,10),(2,50),(3,100),(4,80),(5,60)],['Material','KG'])
df1 = spark.createDataFrame([(1,1.5),(3,5.0),(5,12.0)],['Material','Conversion'])
df.join(df1,['Material'],'left').\
  withColumn("conversion",when(col("conversion").isNull(),lit(" ")).\
    otherwise(col("conversion"))).\
      show(10,False)
#+--------+---+----------+
#|Material|KG |conversion|
#+--------+---+----------+
#|1       |10 |1.5       |
#|2       |50 |          |
#|3       |100|5.0       |
#|4       |80 |          |
#|5       |60 |12.0      |
#+--------+---+----------+

huangapple
  • 本文由 发表于 2023年6月8日 11:03:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76428340.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定