2023年7月11日 06:17:53go评论112阅读模式

英文:

PySpark Iterate Rows and Drop Rows with Specified Value

问题

我有一个像这样的数据框
| 列A | 列B |
| -------- | -------- |
| 你好    | [{id: 1000, abbreviatedId: 1, name: “约翰&quot;, planet: “地球”, solarsystem: “银河系”, universe: “这个”, continent: {id: 33, country: “中国&quot;, Capital: “北京”}, otherId: 400, language: “粤语”, species: 23409, creature: “人类”}]  |
| 再见      | [{id: 2000, abbreviatedId: 2, name: “詹姆斯&quot;, planet: “地球”, solarsystem: “银河系”, universe: “这个”, continent: {id: 33, country: “俄罗斯&quot;, Capital: “莫斯科”}, otherId: 500, language: “俄语”, species: 12308, creature: “人类”}]  |
如何遍历数据框的行，在写入外部位置之前删除所有包含 `country: &quot;China&quot;` 的行？
我尝试过
```python
if df.select(array_contains(col(&quot;columnb.continent.country&quot;), &quot;China&quot;)) != True:
    df.write.format(&quot;delta&quot;).mode(&quot;overwrite&quot;).save(&quot;file://path/&quot;)

和

for row in df.rdd.collect():
    if df.select(array_contains(col(&quot;columnb.continent.country&quot;), &quot;China&quot;)) != True:
      df.drop(row)
df.write.format(&quot;delta&quot;).mode(&quot;overwrite&quot;).save(&quot;file://path/&quot;)


<details>
<summary>英文:</summary>
I have a dataframe like this
| Column A | Column B |
| -------- | -------- |
| Hello    | [{id: 1000, abbreviatedId: 1, name: “John&quot;, planet: “Earth”, solarsystem: “Milky Way”, universe: “this one”, continent: {id: 33, country: “China&quot;, Capital: “Bejing”}, otherId: 400, language: “Cantonese”, species: 23409, creature: “Human”}]  |
| Bye      | [{id: 2000, abbreviatedId: 2, name: “James&quot;, planet: “Earth”, solarsystem: “Milky Way”, universe: “this one”, continent: {id: 33, country: “Russia&quot;, Capital: “Moscow”}, otherId: 500, language: “Russian”, species: 12308, creature: “Human”}]  |
How do I iterate through the rows of the dataframe to drop all rows with `country: &quot;China&quot;` before writing to external location?
I have tried

if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
df.write.format("delta").mode("overwrite").save("file://path/")

and

for row in df.rdd.collect():
if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
df.drop(row)

df.write.format("delta").mode("overwrite").save("file://path/")


</details>
# 答案1
**得分**: 1
你可以循环遍历行，然后在每一行中查找大洲，然后在其中查找国家。
以下是示例代码：
```python
import pandas as pd
# 假设你的DataFrame命名为df
# 遍历DataFrame的行
for index, row in df.iterrows():
    # 访问包含字典的列B中的值
    dict_value = row['列 B']
    
    # 检查字典中的'country'键是否为'China'
    if dict_value[0]['continent']['country'] == 'China':
        # 如果满足条件，删除该行
        df.drop(index, inplace=True)
# 在遍历所有行后，将DataFrame写入外部位置
# 例如：写入CSV文件
df.to_csv('output.csv', index=False)

希望对你有所帮助。

英文:

You can loop through rows and then in each row find continent, and then country in that.

Here's the example code:

import pandas as pd
# Assuming your DataFrame is named df
# Iterate through the rows of the DataFrame
for index, row in df.iterrows():
    # Access the value in Column B, which contains the dictionary
    dict_value = row[&#39;Column B&#39;]
    
    # Check if the &#39;country&#39; key in the dictionary is &quot;China&quot;
    if dict_value[0][&#39;continent&#39;][&#39;country&#39;] == &quot;China&quot;:
        # Drop the row if the condition is met
        df.drop(index, inplace=True)
# After iterating through all the rows, write the DataFrame to an external location
# Example: Writing to a CSV file
df.to_csv(&#39;output.csv&#39;, index=False)

Hope it helps.

答案2

得分: 0

一种方法是使用 exists 数组函数。

from pyspark.sql.functions import expr
from pyspark.sql import Row
df = spark.createDataFrame([
    [
      [
        Row(**{"id": 1000, "abbreviatedId": 1, "name": "John", "planet": "Earth", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 33, "country": "China", "Capital": "Bejing"}), "otherId": 400, "language": "Cantonese", "species": 23409, "creature": "Human"}), 
        Row(**{"id": 1001, "abbreviatedId": 2, "name": "Alex", "planet": "Mars", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 34, "country": "Japan", "Capital": "Tokyo"}), "otherId": 400, "language": "Japanese", "species": 23409, "creature": "Human"})
      ]
]], ["b"])
df.filter(expr("not exists(b, x -> x.continent.country == 'China')"))

语法 Row(**dict) 通过参数解包将创建一个 Row 实例。

英文:

One way is using exists array function.

from pyspark.sql.functions import expr
from pyspark.sql import Row
df = spark.createDataFrame([
    [
      [
        Row(**{&quot;id&quot;: 1000, &quot;abbreviatedId&quot;: 1, &quot;name&quot;: &quot;John&quot;, &quot;planet&quot;: &quot;Earth&quot;, &quot;solarsystem&quot;: &quot;Milky Way&quot;, &quot;universe&quot;: &quot;this one&quot;, &quot;continent&quot;: Row(**{&quot;id&quot;: 33, &quot;country&quot;: &quot;China&quot;, &quot;Capital&quot;: &quot;Bejing&quot;}), &quot;otherId&quot;: 400, &quot;language&quot;: &quot;Cantonese&quot;, &quot;species&quot;: 23409, &quot;creature&quot;: &quot;Human&quot;}), 
        Row(**{&quot;id&quot;: 1001, &quot;abbreviatedId&quot;: 2, &quot;name&quot;: &quot;Alex&quot;, &quot;planet&quot;: &quot;Mars&quot;, &quot;solarsystem&quot;: &quot;Milky Way&quot;, &quot;universe&quot;: &quot;this one&quot;, &quot;continent&quot;: Row(**{&quot;id&quot;: 34, &quot;country&quot;: &quot;Japan&quot;, &quot;Capital&quot;: &quot;Tokyo&quot;}), &quot;otherId&quot;: 400, &quot;language&quot;: &quot;Japanese&quot;, &quot;species&quot;: 23409, &quot;creature&quot;: &quot;Human&quot;})
    ]
]], [&quot;b&quot;])
df.filter(expr(&quot;not exists(b, x -&gt; x.continent.country == &#39;China&#39;)&quot;))

The syntax Row(**dict) will create an instance of Row through argument unpacking.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

PySpark 迭代行并删除具有指定值的行。

问题

答案2

无法安装来自pip的任何软件包；wheel构建失败。

按键字母顺序在Odoo 13中对字典进行排序。

根据另一列中的数值替换缺失数值。

What's the correct way to type hint an empty list as a literal in python?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论