英文:
pyspark fill values with join instead of isin
问题
I can help you translate the code part. Here is the translation:
我可以帮你翻译代码部分。以下是翻译:
我想在几个列值在另一个数据框的列中找到的行上填充[tag:pyspark]数据框,但我不能使用`.collect().distinct()`和`.isin()`,因为与连接相比,这需要很长时间。
在条件填充值时,如何使用连接或广播?
在[tag:pandas]中,我会这样做:
```python
df.loc[(df.A.isin(df2.A)) | (df.B.isin(df2.B)), 'new_column'] = 'new_value'
UPD:到目前为止,我尝试了这种在pyspark中的方法,但从.count()
之前和之后来看,它没有正常工作(行数在人为减少)。
count_first = df.count()
dfA_1 = df.join(df2, 'A', 'leftanti') \
.withColumn('new_column', F.lit(None).cast(StringType()))
dfA_2 = df.join(df2, 'A', 'inner') \
.withColumn('new_column', F.lit('new_value'))
df = dfA_1.unionByName(dfA_2)
count_second = df.count()
cont_first - count_second
如何在[tag:pyspark]中实现相同的功能,但使用连接?
<details>
<summary>英文:</summary>
I want to fill [tag:pyspark] dataframe on rows where several column values are found in other dataframe columns but I cannot use `.collect().distinct()` and `.isin()` since it takes a long time compared to join.
How can I use join or broadcast when filling values conditionally?
In [tag:pandas] I would do:
df.loc[(df.A.isin(df2.A)) | (df.B.isin(df2B)), 'new_column'] = 'new_value'
UPD: so far I tried this approach in pyspark but it did not work right judging by `.count()` before and after (rows count is artificially decreased)
count_first = df.count()
dfA_1 = df.join(df2, 'A', 'leftanti')
.withColumn('new_column', F.lit(None).cast(StringType()))
dfA_2= df.join(df2, 'A', 'inner')
.withColumn('new_column', F.lit('new_value'))
df = dfA_1 .unionByName(dfA_2)
count_second = df.count()
cont_first - count_second
How can I achieve the same in [tag:pyspark] but with join ?
</details>
# 答案1
**得分**: 2
请查看以下示例,我认为你可以做类似的事情
```python
import pyspark.sql.functions as F
df = [
{"A": "1", "B": "1"},
{"A": "2", "B": "2"},
]
df2 = [
{"A": "1"},
{"A": "22"},
]
df = spark.createDataFrame(df)
df2 = spark.createDataFrame(df2).withColumn("C", F.lit("dummy"))
count_first = df.count()
df.join(df2, "A", "left").withColumn(
"newValue", F.when(F.col("C").isNull(), F.lit(None)).otherwise(F.lit("new_value"))
).drop("C").show()
输出是
+---+---+---------+
| A| B| newValue|
+---+---+---------+
| 1| 1|new_value|
| 2| 2| null|
+---+---+---------+
英文:
Please check below example, i think that you can do something similar
import pyspark.sql.functions as F
df = [
{"A": "1", "B": "1"},
{"A": "2", "B": "2"},
]
df2 = [
{"A": "1"},
{"A": "22"},
]
df = spark.createDataFrame(df)
df2 = spark.createDataFrame(df2).withColumn("C", F.lit("dummy"))
count_first = df.count()
df.join(df2, "A", "left").withColumn(
"newValue", F.when(F.col("C").isNull(), F.lit(None)).otherwise(F.lit("new_value"))
).drop("C").show()
The output is
+---+---+---------+
| A| B| newValue|
+---+---+---------+
| 1| 1|new_value|
| 2| 2| null|
+---+---+---------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论