问题

I can help you translate the code part. Here is the translation:

我可以帮你翻译代码部分。以下是翻译：

我想在几个列值在另一个数据框的列中找到的行上填充[tag:pyspark]数据框，但我不能使用`.collect().distinct()`和`.isin()`，因为与连接相比，这需要很长时间。
在条件填充值时，如何使用连接或广播？
在[tag:pandas]中，我会这样做：
```python
df.loc[(df.A.isin(df2.A)) | (df.B.isin(df2.B)), 'new_column'] = 'new_value'

UPD：到目前为止，我尝试了这种在pyspark中的方法，但从.count()之前和之后来看，它没有正常工作（行数在人为减少）。

count_first = df.count()
dfA_1 = df.join(df2, 'A', 'leftanti') \
  .withColumn('new_column', F.lit(None).cast(StringType()))
dfA_2 = df.join(df2, 'A', 'inner') \
  .withColumn('new_column', F.lit('new_value'))
df = dfA_1.unionByName(dfA_2)
count_second = df.count()
cont_first - count_second

如何在[tag:pyspark]中实现相同的功能，但使用连接？


<details>
<summary>英文:</summary>

I want to fill [tag:pyspark] dataframe on rows where several column values are found in other dataframe columns but I cannot use `.collect().distinct()` and `.isin()` since it takes a long time compared to join. 
How can I use join or broadcast when filling values conditionally? 
In [tag:pandas] I would do:

df.loc[(df.A.isin(df2.A)) | (df.B.isin(df2B)), 'new_column'] = 'new_value'


UPD: so far I tried this approach in pyspark but it did not work right judging by `.count()` before and after (rows count is artificially decreased)

count_first = df.count()
dfA_1 = df.join(df2, 'A', 'leftanti')
.withColumn('new_column', F.lit(None).cast(StringType()))
dfA_2= df.join(df2, 'A', 'inner')
.withColumn('new_column', F.lit('new_value'))
df = dfA_1 .unionByName(dfA_2)
count_second = df.count()
cont_first - count_second


How can I achieve the same in [tag:pyspark] but with join ?

</details>


# 答案1
**得分**: 2

请查看以下示例，我认为你可以做类似的事情

```python
import pyspark.sql.functions as F

df = [
    {"A": "1", "B": "1"},
    {"A": "2", "B": "2"},
]

df2 = [
    {"A": "1"},
    {"A": "22"},
]

df = spark.createDataFrame(df)
df2 = spark.createDataFrame(df2).withColumn("C", F.lit("dummy"))

count_first = df.count()
df.join(df2, "A", "left").withColumn(
    "newValue", F.when(F.col("C").isNull(), F.lit(None)).otherwise(F.lit("new_value"))
).drop("C").show()

输出是

+---+---+---------+
|  A|  B| newValue|
+---+---+---------+
|  1|  1|new_value|
|  2|  2|     null|
+---+---+---------+

英文:

Please check below example, i think that you can do something similar

import pyspark.sql.functions as F

df = [
    {&quot;A&quot;: &quot;1&quot;, &quot;B&quot;: &quot;1&quot;},
    {&quot;A&quot;: &quot;2&quot;, &quot;B&quot;: &quot;2&quot;},
]

df2 = [
    {&quot;A&quot;: &quot;1&quot;},
    {&quot;A&quot;: &quot;22&quot;},
]

df = spark.createDataFrame(df)
df2 = spark.createDataFrame(df2).withColumn(&quot;C&quot;, F.lit(&quot;dummy&quot;))

count_first = df.count()
df.join(df2, &quot;A&quot;, &quot;left&quot;).withColumn(
    &quot;newValue&quot;, F.when(F.col(&quot;C&quot;).isNull(), F.lit(None)).otherwise(F.lit(&quot;new_value&quot;))
).drop(&quot;C&quot;).show()

The output is

+---+---+---------+
|  A|  B| newValue|
+---+---+---------+
|  1|  1|new_value|
|  2|  2|     null|
+---+---+---------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

用联接而不是 isin 填充值。

问题

如何将一个列表的列表转换为多层嵌套字典？

小数精度超过最大精度，尽管小数具有正确的大小和精度。

Unable to call "celery status" in Django, RabbitMQ, Celery container on Docker. KeyError: 'No such transport'

在wxPython ListCtrl中搜索项目。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论