PySpark 迭代行并删除具有指定值的行。

huangapple go评论87阅读模式
英文:

PySpark Iterate Rows and Drop Rows with Specified Value

问题

我有一个像这样的数据框

| 列A | 列B |
| -------- | -------- |
| 你好    | [{id: 1000, abbreviatedId: 1, name: 约翰", planet: 地球, solarsystem: 银河系, universe: 这个, continent: {id: 33, country: 中国", Capital: 北京}, otherId: 400, language: 粤语, species: 23409, creature: 人类}]  |
| 再见      | [{id: 2000, abbreviatedId: 2, name: 詹姆斯", planet: 地球, solarsystem: 银河系, universe: 这个, continent: {id: 33, country: 俄罗斯", Capital: 莫斯科}, otherId: 500, language: 俄语, species: 12308, creature: 人类}]  |


如何遍历数据框的行在写入外部位置之前删除所有包含 `country: "China"` 的行

我尝试过

```python
if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
    df.write.format("delta").mode("overwrite").save("file://path/")

for row in df.rdd.collect():
    if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
      df.drop(row)

df.write.format("delta").mode("overwrite").save("file://path/")

<details>
<summary>英文:</summary>

I have a dataframe like this

| Column A | Column B |
| -------- | -------- |
| Hello    | [{id: 1000, abbreviatedId: 1, name: “John&quot;, planet: “Earth”, solarsystem: “Milky Way”, universe: “this one”, continent: {id: 33, country: “China&quot;, Capital: “Bejing”}, otherId: 400, language: “Cantonese”, species: 23409, creature: “Human”}]  |
| Bye      | [{id: 2000, abbreviatedId: 2, name: “James&quot;, planet: “Earth”, solarsystem: “Milky Way”, universe: “this one”, continent: {id: 33, country: “Russia&quot;, Capital: “Moscow”}, otherId: 500, language: “Russian”, species: 12308, creature: “Human”}]  |


How do I iterate through the rows of the dataframe to drop all rows with `country: &quot;China&quot;` before writing to external location?

I have tried

if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
df.write.format("delta").mode("overwrite").save("file://path/")


and

for row in df.rdd.collect():
if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
df.drop(row)

df.write.format("delta").mode("overwrite").save("file://path/")


</details>


# 答案1
**得分**: 1

你可以循环遍历行,然后在每一行中查找大洲,然后在其中查找国家。

以下是示例代码:

```python
import pandas as pd

# 假设你的DataFrame命名为df

# 遍历DataFrame的行
for index, row in df.iterrows():
    # 访问包含字典的列B中的值
    dict_value = row['列 B']
    
    # 检查字典中的'country'键是否为'China'
    if dict_value[0]['continent']['country'] == 'China':
        # 如果满足条件,删除该行
        df.drop(index, inplace=True)

# 在遍历所有行后,将DataFrame写入外部位置
# 例如:写入CSV文件
df.to_csv('output.csv', index=False)

希望对你有所帮助。

英文:

You can loop through rows and then in each row find continent, and then country in that.

Here's the example code:

import pandas as pd

# Assuming your DataFrame is named df

# Iterate through the rows of the DataFrame
for index, row in df.iterrows():
    # Access the value in Column B, which contains the dictionary
    dict_value = row[&#39;Column B&#39;]
    
    # Check if the &#39;country&#39; key in the dictionary is &quot;China&quot;
    if dict_value[0][&#39;continent&#39;][&#39;country&#39;] == &quot;China&quot;:
        # Drop the row if the condition is met
        df.drop(index, inplace=True)

# After iterating through all the rows, write the DataFrame to an external location
# Example: Writing to a CSV file
df.to_csv(&#39;output.csv&#39;, index=False)

Hope it helps.

答案2

得分: 0

一种方法是使用 exists 数组函数。

from pyspark.sql.functions import expr
from pyspark.sql import Row

df = spark.createDataFrame([
    [
      [
        Row(**{"id": 1000, "abbreviatedId": 1, "name": "John", "planet": "Earth", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 33, "country": "China", "Capital": "Bejing"}), "otherId": 400, "language": "Cantonese", "species": 23409, "creature": "Human"}), 
        Row(**{"id": 1001, "abbreviatedId": 2, "name": "Alex", "planet": "Mars", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 34, "country": "Japan", "Capital": "Tokyo"}), "otherId": 400, "language": "Japanese", "species": 23409, "creature": "Human"})
      ]
]], ["b"])


df.filter(expr("not exists(b, x -> x.continent.country == 'China')"))

语法 Row(**dict) 通过参数解包将创建一个 Row 实例。

英文:

One way is using exists array function.

from pyspark.sql.functions import expr
from pyspark.sql import Row

df = spark.createDataFrame([
    [
      [
        Row(**{&quot;id&quot;: 1000, &quot;abbreviatedId&quot;: 1, &quot;name&quot;: &quot;John&quot;, &quot;planet&quot;: &quot;Earth&quot;, &quot;solarsystem&quot;: &quot;Milky Way&quot;, &quot;universe&quot;: &quot;this one&quot;, &quot;continent&quot;: Row(**{&quot;id&quot;: 33, &quot;country&quot;: &quot;China&quot;, &quot;Capital&quot;: &quot;Bejing&quot;}), &quot;otherId&quot;: 400, &quot;language&quot;: &quot;Cantonese&quot;, &quot;species&quot;: 23409, &quot;creature&quot;: &quot;Human&quot;}), 
        Row(**{&quot;id&quot;: 1001, &quot;abbreviatedId&quot;: 2, &quot;name&quot;: &quot;Alex&quot;, &quot;planet&quot;: &quot;Mars&quot;, &quot;solarsystem&quot;: &quot;Milky Way&quot;, &quot;universe&quot;: &quot;this one&quot;, &quot;continent&quot;: Row(**{&quot;id&quot;: 34, &quot;country&quot;: &quot;Japan&quot;, &quot;Capital&quot;: &quot;Tokyo&quot;}), &quot;otherId&quot;: 400, &quot;language&quot;: &quot;Japanese&quot;, &quot;species&quot;: 23409, &quot;creature&quot;: &quot;Human&quot;})
    ]
]], [&quot;b&quot;])


df.filter(expr(&quot;not exists(b, x -&gt; x.continent.country == &#39;China&#39;)&quot;))

The syntax Row(**dict) will create an instance of Row through argument unpacking.

huangapple
  • 本文由 发表于 2023年7月11日 06:17:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76657664.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定