英文:
Dataframe: how to lookup values in another table, if not found, return blank/Databricks, pyspark
问题
我在Databricks中工作,根据两个现有表格创建数据框。我需要查找第二个表中的值,如果找到,则返回该值,否则返回空白。
例如:
df1:
Material | KG |
---|---|
1 | 10 |
2 | 50 |
3 | 100 |
4 | 80 |
5 | 60 |
df2:
Material | Conversion |
---|---|
1 | 1.5 |
3 | 5 |
5 | 12 |
我想要得到下面的df3:
Material | KG | Conversion |
---|---|---|
1 | 10 | 1.5 |
2 | 50 | |
3 | 100 | 5 |
4 | 80 | |
5 | 60 | 12 |
我使用以下方法进行添加,但它只连接在df2中找到的那些,所以它只显示材料1、3、5,删除了材料2、4,我仍然想保留1-5:
df3 = df1.join(df2,["Material"]).withColumnRenamed("Material", "Material_new").drop("Material_new")
display(df3)
英文:
I'm working in Databricks to create data frame based on two existing tables. I need to look up values in 2nd table, return the value, if not found, then return blank.
for example:
df1:
Material | KG |
---|---|
1 | 10 |
2 | 50 |
3 | 100 |
4 | 80 |
5 | 60 |
df2:
Material | Conversion |
---|---|
1 | 1.5 |
3 | 5 |
5 | 12 |
I'd like to have the below df3
Material | KG | Conversion |
---|---|---|
1 | 10 | 1.5 |
2 | 50 | |
3 | 100 | 5 |
4 | 80 | |
5 | 60 | 12 |
I used the below to add, but it join only those found in df2, so it only shows material 135 and removed Material 24, I still want to keep 1-5
df3 = df1.join(df2,["Material"]).withColumnRenamed("Material", "Material_new").drop("Material_new")
display(df3)
I'd like to have the below df3
Material | KG | Conversion |
---|---|---|
1 | 10 | 1.5 |
2 | 50 | |
3 | 100 | 5 |
4 | 80 | |
5 | 60 | 12 |
答案1
得分: 0
尝试使用**left
**连接。
示例:
df = spark.createDataFrame([(1,10),(2,50),(3,100),(4,80),(5,60)],['Material','KG'])
df1 = spark.createDataFrame([(1,1.5),(3,5.0),(5,12.0)],['Material','Conversion'])
df.join(df1,['Material'],'left').show(10,False)
#+--------+---+----------+
#|Material|KG |Conversion|
#+--------+---+----------+
#|1 |10 |1.5 |
#|2 |50 |null |
#|3 |100|5.0 |
#|4 |80 |null |
#|5 |60 |12.0 |
#+--------+---+----------+
将null
替换为' '
from pyspark.sql.functions import *
df = spark.createDataFrame([(1,10),(2,50),(3,100),(4,80),(5,60)],['Material','KG'])
df1 = spark.createDataFrame([(1,1.5),(3,5.0),(5,12.0)],['Material','Conversion'])
df.join(df1,['Material'],'left').\
withColumn("conversion",when(col("conversion").isNull(),lit(" ")).\
otherwise(col("conversion"))).\
show(10,False)
#+--------+---+----------+
#|Material|KG |conversion|
#+--------+---+----------+
#|1 |10 |1.5 |
#|2 |50 | |
#|3 |100|5.0 |
#|4 |80 | |
#|5 |60 |12.0 |
#+--------+---+----------+
英文:
Try with left
join for this case.
Example:
df = spark.createDataFrame([(1,10),(2,50),(3,100),(4,80),(5,60)],['Material','KG'])
df1 = spark.createDataFrame([(1,1.5),(3,5.0),(5,12.0)],['Material','Conversion'])
df.join(df1,['Material'],'left').show(10,False)
#+--------+---+----------+
#|Material|KG |Conversion|
#+--------+---+----------+
#|1 |10 |1.5 |
#|2 |50 |null |
#|3 |100|5.0 |
#|4 |80 |null |
#|5 |60 |12.0 |
#+--------+---+----------+
Replace null
with ' '
from pyspark.sql.functions import *
df = spark.createDataFrame([(1,10),(2,50),(3,100),(4,80),(5,60)],['Material','KG'])
df1 = spark.createDataFrame([(1,1.5),(3,5.0),(5,12.0)],['Material','Conversion'])
df.join(df1,['Material'],'left').\
withColumn("conversion",when(col("conversion").isNull(),lit(" ")).\
otherwise(col("conversion"))).\
show(10,False)
#+--------+---+----------+
#|Material|KG |conversion|
#+--------+---+----------+
#|1 |10 |1.5 |
#|2 |50 | |
#|3 |100|5.0 |
#|4 |80 | |
#|5 |60 |12.0 |
#+--------+---+----------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论