PySpark 迭代行并删除具有指定值的行。

huangapple go评论112阅读模式
英文:

PySpark Iterate Rows and Drop Rows with Specified Value

问题

  1. 我有一个像这样的数据框
  2. | A | B |
  3. | -------- | -------- |
  4. | 你好 | [{id: 1000, abbreviatedId: 1, name: 约翰", planet: 地球, solarsystem: 银河系, universe: 这个, continent: {id: 33, country: 中国", Capital: 北京}, otherId: 400, language: 粤语, species: 23409, creature: 人类}] |
  5. | 再见 | [{id: 2000, abbreviatedId: 2, name: 詹姆斯", planet: 地球, solarsystem: 银河系, universe: 这个, continent: {id: 33, country: 俄罗斯", Capital: 莫斯科}, otherId: 500, language: 俄语, species: 12308, creature: 人类}] |
  6. 如何遍历数据框的行在写入外部位置之前删除所有包含 `country: "China"` 的行
  7. 我尝试过
  8. ```python
  9. if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
  10. df.write.format("delta").mode("overwrite").save("file://path/")

  1. for row in df.rdd.collect():
  2. if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
  3. df.drop(row)
  4. df.write.format("delta").mode("overwrite").save("file://path/")
  1. <details>
  2. <summary>英文:</summary>
  3. I have a dataframe like this
  4. | Column A | Column B |
  5. | -------- | -------- |
  6. | Hello | [{id: 1000, abbreviatedId: 1, name: “John&quot;, planet: “Earth”, solarsystem: “Milky Way”, universe: “this one”, continent: {id: 33, country: “China&quot;, Capital: “Bejing”}, otherId: 400, language: “Cantonese”, species: 23409, creature: “Human”}] |
  7. | Bye | [{id: 2000, abbreviatedId: 2, name: “James&quot;, planet: “Earth”, solarsystem: “Milky Way”, universe: “this one”, continent: {id: 33, country: “Russia&quot;, Capital: “Moscow”}, otherId: 500, language: “Russian”, species: 12308, creature: “Human”}] |
  8. How do I iterate through the rows of the dataframe to drop all rows with `country: &quot;China&quot;` before writing to external location?
  9. I have tried

if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
df.write.format("delta").mode("overwrite").save("file://path/")

  1. and

for row in df.rdd.collect():
if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
df.drop(row)

df.write.format("delta").mode("overwrite").save("file://path/")

  1. </details>
  2. # 答案1
  3. **得分**: 1
  4. 你可以循环遍历行,然后在每一行中查找大洲,然后在其中查找国家。
  5. 以下是示例代码:
  6. ```python
  7. import pandas as pd
  8. # 假设你的DataFrame命名为df
  9. # 遍历DataFrame的行
  10. for index, row in df.iterrows():
  11. # 访问包含字典的列B中的值
  12. dict_value = row['列 B']
  13. # 检查字典中的'country'键是否为'China'
  14. if dict_value[0]['continent']['country'] == 'China':
  15. # 如果满足条件,删除该行
  16. df.drop(index, inplace=True)
  17. # 在遍历所有行后,将DataFrame写入外部位置
  18. # 例如:写入CSV文件
  19. df.to_csv('output.csv', index=False)

希望对你有所帮助。

英文:

You can loop through rows and then in each row find continent, and then country in that.

Here's the example code:

  1. import pandas as pd
  2. # Assuming your DataFrame is named df
  3. # Iterate through the rows of the DataFrame
  4. for index, row in df.iterrows():
  5. # Access the value in Column B, which contains the dictionary
  6. dict_value = row[&#39;Column B&#39;]
  7. # Check if the &#39;country&#39; key in the dictionary is &quot;China&quot;
  8. if dict_value[0][&#39;continent&#39;][&#39;country&#39;] == &quot;China&quot;:
  9. # Drop the row if the condition is met
  10. df.drop(index, inplace=True)
  11. # After iterating through all the rows, write the DataFrame to an external location
  12. # Example: Writing to a CSV file
  13. df.to_csv(&#39;output.csv&#39;, index=False)

Hope it helps.

答案2

得分: 0

一种方法是使用 exists 数组函数。

  1. from pyspark.sql.functions import expr
  2. from pyspark.sql import Row
  3. df = spark.createDataFrame([
  4. [
  5. [
  6. Row(**{"id": 1000, "abbreviatedId": 1, "name": "John", "planet": "Earth", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 33, "country": "China", "Capital": "Bejing"}), "otherId": 400, "language": "Cantonese", "species": 23409, "creature": "Human"}),
  7. Row(**{"id": 1001, "abbreviatedId": 2, "name": "Alex", "planet": "Mars", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 34, "country": "Japan", "Capital": "Tokyo"}), "otherId": 400, "language": "Japanese", "species": 23409, "creature": "Human"})
  8. ]
  9. ]], ["b"])
  10. df.filter(expr("not exists(b, x -> x.continent.country == 'China')"))

语法 Row(**dict) 通过参数解包将创建一个 Row 实例。

英文:

One way is using exists array function.

  1. from pyspark.sql.functions import expr
  2. from pyspark.sql import Row
  3. df = spark.createDataFrame([
  4. [
  5. [
  6. Row(**{&quot;id&quot;: 1000, &quot;abbreviatedId&quot;: 1, &quot;name&quot;: &quot;John&quot;, &quot;planet&quot;: &quot;Earth&quot;, &quot;solarsystem&quot;: &quot;Milky Way&quot;, &quot;universe&quot;: &quot;this one&quot;, &quot;continent&quot;: Row(**{&quot;id&quot;: 33, &quot;country&quot;: &quot;China&quot;, &quot;Capital&quot;: &quot;Bejing&quot;}), &quot;otherId&quot;: 400, &quot;language&quot;: &quot;Cantonese&quot;, &quot;species&quot;: 23409, &quot;creature&quot;: &quot;Human&quot;}),
  7. Row(**{&quot;id&quot;: 1001, &quot;abbreviatedId&quot;: 2, &quot;name&quot;: &quot;Alex&quot;, &quot;planet&quot;: &quot;Mars&quot;, &quot;solarsystem&quot;: &quot;Milky Way&quot;, &quot;universe&quot;: &quot;this one&quot;, &quot;continent&quot;: Row(**{&quot;id&quot;: 34, &quot;country&quot;: &quot;Japan&quot;, &quot;Capital&quot;: &quot;Tokyo&quot;}), &quot;otherId&quot;: 400, &quot;language&quot;: &quot;Japanese&quot;, &quot;species&quot;: 23409, &quot;creature&quot;: &quot;Human&quot;})
  8. ]
  9. ]], [&quot;b&quot;])
  10. df.filter(expr(&quot;not exists(b, x -&gt; x.continent.country == &#39;China&#39;)&quot;))

The syntax Row(**dict) will create an instance of Row through argument unpacking.

huangapple
  • 本文由 发表于 2023年7月11日 06:17:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76657664.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定