用联接而不是 isin 填充值。

huangapple go评论58阅读模式
英文:

pyspark fill values with join instead of isin

问题

I can help you translate the code part. Here is the translation:

我可以帮你翻译代码部分。以下是翻译:

我想在几个列值在另一个数据框的列中找到的行上填充[tag:pyspark]数据框但我不能使用`.collect().distinct()``.isin()`,因为与连接相比这需要很长时间
在条件填充值时如何使用连接或广播
在[tag:pandas]中我会这样做
```python
df.loc[(df.A.isin(df2.A)) | (df.B.isin(df2.B)), 'new_column'] = 'new_value'

UPD:到目前为止,我尝试了这种在pyspark中的方法,但从.count()之前和之后来看,它没有正常工作(行数在人为减少)。

count_first = df.count()
dfA_1 = df.join(df2, 'A', 'leftanti') \
  .withColumn('new_column', F.lit(None).cast(StringType()))
dfA_2 = df.join(df2, 'A', 'inner') \
  .withColumn('new_column', F.lit('new_value'))
df = dfA_1.unionByName(dfA_2)
count_second = df.count()
cont_first - count_second

如何在[tag:pyspark]中实现相同的功能,但使用连接?


<details>
<summary>英文:</summary>

I want to fill [tag:pyspark] dataframe on rows where several column values are found in other dataframe columns but I cannot use `.collect().distinct()` and `.isin()` since it takes a long time compared to join. 
How can I use join or broadcast when filling values conditionally? 
In [tag:pandas] I would do:

df.loc[(df.A.isin(df2.A)) | (df.B.isin(df2B)), 'new_column'] = 'new_value'


UPD: so far I tried this approach in pyspark but it did not work right judging by `.count()` before and after (rows count is artificially decreased)

count_first = df.count()
dfA_1 = df.join(df2, 'A', 'leftanti')
.withColumn('new_column', F.lit(None).cast(StringType()))
dfA_2= df.join(df2, 'A', 'inner')
.withColumn('new_column', F.lit('new_value'))
df = dfA_1 .unionByName(dfA_2)
count_second = df.count()
cont_first - count_second


How can I achieve the same in [tag:pyspark] but with join ?

</details>


# 答案1
**得分**: 2

请查看以下示例,我认为你可以做类似的事情

```python
import pyspark.sql.functions as F

df = [
    {"A": "1", "B": "1"},
    {"A": "2", "B": "2"},
]

df2 = [
    {"A": "1"},
    {"A": "22"},
]

df = spark.createDataFrame(df)
df2 = spark.createDataFrame(df2).withColumn("C", F.lit("dummy"))

count_first = df.count()
df.join(df2, "A", "left").withColumn(
    "newValue", F.when(F.col("C").isNull(), F.lit(None)).otherwise(F.lit("new_value"))
).drop("C").show()

输出是

+---+---+---------+
|  A|  B| newValue|
+---+---+---------+
|  1|  1|new_value|
|  2|  2|     null|
+---+---+---------+
英文:

Please check below example, i think that you can do something similar

import pyspark.sql.functions as F

df = [
    {&quot;A&quot;: &quot;1&quot;, &quot;B&quot;: &quot;1&quot;},
    {&quot;A&quot;: &quot;2&quot;, &quot;B&quot;: &quot;2&quot;},
]

df2 = [
    {&quot;A&quot;: &quot;1&quot;},
    {&quot;A&quot;: &quot;22&quot;},
]

df = spark.createDataFrame(df)
df2 = spark.createDataFrame(df2).withColumn(&quot;C&quot;, F.lit(&quot;dummy&quot;))

count_first = df.count()
df.join(df2, &quot;A&quot;, &quot;left&quot;).withColumn(
    &quot;newValue&quot;, F.when(F.col(&quot;C&quot;).isNull(), F.lit(None)).otherwise(F.lit(&quot;new_value&quot;))
).drop(&quot;C&quot;).show()

The output is

+---+---+---------+
|  A|  B| newValue|
+---+---+---------+
|  1|  1|new_value|
|  2|  2|     null|
+---+---+---------+

huangapple
  • 本文由 发表于 2023年4月13日 15:57:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76002994.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定