英文:
PySpark Iterate Rows and Drop Rows with Specified Value
问题
我有一个像这样的数据框
| 列A | 列B |
| -------- | -------- |
| 你好 | [{id: 1000, abbreviatedId: 1, name: “约翰", planet: “地球”, solarsystem: “银河系”, universe: “这个”, continent: {id: 33, country: “中国", Capital: “北京”}, otherId: 400, language: “粤语”, species: 23409, creature: “人类”}] |
| 再见 | [{id: 2000, abbreviatedId: 2, name: “詹姆斯", planet: “地球”, solarsystem: “银河系”, universe: “这个”, continent: {id: 33, country: “俄罗斯", Capital: “莫斯科”}, otherId: 500, language: “俄语”, species: 12308, creature: “人类”}] |
如何遍历数据框的行,在写入外部位置之前删除所有包含 `country: "China"` 的行?
我尝试过
```python
if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
df.write.format("delta").mode("overwrite").save("file://path/")
和
for row in df.rdd.collect():
if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
df.drop(row)
df.write.format("delta").mode("overwrite").save("file://path/")
<details>
<summary>英文:</summary>
I have a dataframe like this
| Column A | Column B |
| -------- | -------- |
| Hello | [{id: 1000, abbreviatedId: 1, name: “John", planet: “Earth”, solarsystem: “Milky Way”, universe: “this one”, continent: {id: 33, country: “China", Capital: “Bejing”}, otherId: 400, language: “Cantonese”, species: 23409, creature: “Human”}] |
| Bye | [{id: 2000, abbreviatedId: 2, name: “James", planet: “Earth”, solarsystem: “Milky Way”, universe: “this one”, continent: {id: 33, country: “Russia", Capital: “Moscow”}, otherId: 500, language: “Russian”, species: 12308, creature: “Human”}] |
How do I iterate through the rows of the dataframe to drop all rows with `country: "China"` before writing to external location?
I have tried
if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
df.write.format("delta").mode("overwrite").save("file://path/")
and
for row in df.rdd.collect():
if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
df.drop(row)
df.write.format("delta").mode("overwrite").save("file://path/")
</details>
# 答案1
**得分**: 1
你可以循环遍历行,然后在每一行中查找大洲,然后在其中查找国家。
以下是示例代码:
```python
import pandas as pd
# 假设你的DataFrame命名为df
# 遍历DataFrame的行
for index, row in df.iterrows():
# 访问包含字典的列B中的值
dict_value = row['列 B']
# 检查字典中的'country'键是否为'China'
if dict_value[0]['continent']['country'] == 'China':
# 如果满足条件,删除该行
df.drop(index, inplace=True)
# 在遍历所有行后,将DataFrame写入外部位置
# 例如:写入CSV文件
df.to_csv('output.csv', index=False)
希望对你有所帮助。
英文:
You can loop through rows and then in each row find continent, and then country in that.
Here's the example code:
import pandas as pd
# Assuming your DataFrame is named df
# Iterate through the rows of the DataFrame
for index, row in df.iterrows():
# Access the value in Column B, which contains the dictionary
dict_value = row['Column B']
# Check if the 'country' key in the dictionary is "China"
if dict_value[0]['continent']['country'] == "China":
# Drop the row if the condition is met
df.drop(index, inplace=True)
# After iterating through all the rows, write the DataFrame to an external location
# Example: Writing to a CSV file
df.to_csv('output.csv', index=False)
Hope it helps.
答案2
得分: 0
一种方法是使用 exists 数组函数。
from pyspark.sql.functions import expr
from pyspark.sql import Row
df = spark.createDataFrame([
[
[
Row(**{"id": 1000, "abbreviatedId": 1, "name": "John", "planet": "Earth", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 33, "country": "China", "Capital": "Bejing"}), "otherId": 400, "language": "Cantonese", "species": 23409, "creature": "Human"}),
Row(**{"id": 1001, "abbreviatedId": 2, "name": "Alex", "planet": "Mars", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 34, "country": "Japan", "Capital": "Tokyo"}), "otherId": 400, "language": "Japanese", "species": 23409, "creature": "Human"})
]
]], ["b"])
df.filter(expr("not exists(b, x -> x.continent.country == 'China')"))
语法 Row(**dict)
通过参数解包将创建一个 Row 实例。
英文:
One way is using exists array function.
from pyspark.sql.functions import expr
from pyspark.sql import Row
df = spark.createDataFrame([
[
[
Row(**{"id": 1000, "abbreviatedId": 1, "name": "John", "planet": "Earth", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 33, "country": "China", "Capital": "Bejing"}), "otherId": 400, "language": "Cantonese", "species": 23409, "creature": "Human"}),
Row(**{"id": 1001, "abbreviatedId": 2, "name": "Alex", "planet": "Mars", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 34, "country": "Japan", "Capital": "Tokyo"}), "otherId": 400, "language": "Japanese", "species": 23409, "creature": "Human"})
]
]], ["b"])
df.filter(expr("not exists(b, x -> x.continent.country == 'China')"))
The syntax Row(**dict)
will create an instance of Row through argument unpacking.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论