将数据框转换为字典Pyspark。

huangapple go评论89阅读模式
英文:

convert dataframe to dictionary pyspark

问题

我有一个表格,如下所示

  1. |item_name|item_value|timestamp |idx|description|
  2. +---------+----------+----------------------------+---+-----------+
  3. |A |0.25 |2023-03-01T17:20:00.000+0000|0 |unitA& |
  4. |B |0.34 |2023-03-01T17:20:00.000+0000|0 |null |
  5. |C |0.34 |2023-03-01T17:20:00.000+0000|0 |C& |
  6. |D |0.45 |2023-03-01T17:20:00.000+0000|0 |unitD& |
  7. |A |0.3 |2023-03-01T17:25:00.000+0000|1 |unitA& |
  8. |B |0.54 |2023-03-01T17:25:00.000+0000|1 |null |
  9. |C |0.3 |2023-03-01T17:25:00.000+0000|1 |C& |
  10. |D |0.21 |2023-03-01T17:25:00.000+0000|1 |unitD& |
  11. |A |0.54 |2023-03-01T17:30:00.000+0000|2 |unitA& |

我想将其转换为以下结构的字典

  1. {'unitA&': {0: 0.25, 1: 0.3, 2: 0.54},
  2. 'C&': {0: 0.34, 1: 0.3},
  3. 'unitD&': {0: 0.45, 1: 0.21}}

我该如何使用Pyspark来实现这个目标?

非常感谢您的任何帮助!

英文:

I have a table like this below

  1. |item_name|item_value|timestamp |idx|description|
  2. +---------+----------+----------------------------+---+-----------+
  3. |A |0.25 |2023-03-01T17:20:00.000+0000|0 |unitA& |
  4. |B |0.34 |2023-03-01T17:20:00.000+0000|0 |null |
  5. |C |0.34 |2023-03-01T17:20:00.000+0000|0 |C& |
  6. |D |0.45 |2023-03-01T17:20:00.000+0000|0 |unitD& |
  7. |A |0.3 |2023-03-01T17:25:00.000+0000|1 |unitA& |
  8. |B |0.54 |2023-03-01T17:25:00.000+0000|1 |null |
  9. |C |0.3 |2023-03-01T17:25:00.000+0000|1 |C& |
  10. |D |0.21 |2023-03-01T17:25:00.000+0000|1 |unitD& |
  11. |A |0.54 |2023-03-01T17:30:00.000+0000|2 |unitA& |

and I want to convert it to a dict in this structure

  1. {'unitA&': {0: 0.25,1: 0.3,2:0.54},
  2. 'C&': {0: 0.34, 1: 0.3},
  3. 'unitD& ': {0: 0.45,1: 0.21}}

How can I do this with Pyspark?

Any help would be appreciated!

答案1

得分: 1

你可以使用内置的 Spark 函数 map_from_entriescollect_liststruct 来获取你需要的结果。

示例:

  1. df.show(20,False)
  2. #sampledata
  3. #+---------+----------+----------------------------+---+-----------+
  4. #|item_name|item_value|timestamp |idx|description|
  5. #+---------+----------+----------------------------+---+-----------+
  6. #|A |0.25 |2023-03-01T17:20:00.000+0000|0 |unitA& |
  7. #|B |0.34 |2023-03-01T17:20:00.000+0000|0 |null |
  8. #|C |0.34 |2023-03-01T17:20:00.000+0000|0 |C& |
  9. #|D |0.45 |2023-03-01T17:20:00.000+0000|0 |unitD& |
  10. #|A |0.3 |2023-03-01T17:25:00.000+0000|1 |unitA& |
  11. #|B |0.54 |2023-03-01T17:25:00.000+0000|1 |null |
  12. #|C |0.3 |2023-03-01T17:25:00.000+0000|1 |C& |
  13. #|D |0.21 |2023-03-01T17:25:00.000+0000|1 |unitD& |
  14. #|A |0.54 |2023-03-01T17:30:00.000+0000|2 |unitA& |
  15. #+---------+----------+----------------------------+---+-----------+
  16. df = df.filter(col("description").isNotNull()).groupBy("description").\
  17. agg(map_from_entries(collect_list(struct(col("idx"),col("item_value")))).alias("temp_desc")).\
  18. groupBy(lit(1)).\
  19. agg(map_from_entries(collect_list(struct(col("description"),col("temp_desc")))).alias("a")).\
  20. select("a")
  21. req_dict = str(df.collect()[0][0])
  22. print(req_dict)

结果:

  1. {'unitD&': {0: 0.45, 1: 0.21}, 'C&': {0: 0.34, 1: 0.3}, 'unitA&': {0: 0.25, 1: 0.3, 2: 0.54}}
英文:

You can use map_from_entries,collect_list,struct inbuilt spark functions to get your required result.

Example:

  1. df.show(20,False)
  2. #sampledata
  3. #+---------+----------+----------------------------+---+-----------+
  4. #|item_name|item_value|timestamp |idx|description|
  5. #+---------+----------+----------------------------+---+-----------+
  6. #|A |0.25 |2023-03-01T17:20:00.000+0000|0 |unitA& |
  7. #|B |0.34 |2023-03-01T17:20:00.000+0000|0 |null |
  8. #|C |0.34 |2023-03-01T17:20:00.000+0000|0 |C& |
  9. #|D |0.45 |2023-03-01T17:20:00.000+0000|0 |unitD& |
  10. #|A |0.3 |2023-03-01T17:25:00.000+0000|1 |unitA& |
  11. #|B |0.54 |2023-03-01T17:25:00.000+0000|1 |null |
  12. #|C |0.3 |2023-03-01T17:25:00.000+0000|1 |C& |
  13. #|D |0.21 |2023-03-01T17:25:00.000+0000|1 |unitD& |
  14. #|A |0.54 |2023-03-01T17:30:00.000+0000|2 |unitA& |
  15. #+---------+----------+----------------------------+---+-----------+
  16. df = df.filter(col("description").isNotNull()).groupBy("description").\
  17. agg(map_from_entries(collect_list(struct(col("idx"),col("item_value")))).alias("temp_desc")).\
  18. groupBy(lit(1)).\
  19. agg(map_from_entries(collect_list(struct(col("description"),col("temp_desc")))).alias("a")).\
  20. select("a")
  21. req_dict = str(df.collect()[0][0])
  22. print(req_dict)

Result:

  1. {'unitD&': {0: 0.45, 1: 0.21}, 'C&': {0: 0.34, 1: 0.3}, 'unitA&': {0: 0.25, 1: 0.3, 2: 0.54}}

huangapple
  • 本文由 发表于 2023年3月12日 08:08:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/75710324.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定