英文:
convert dataframe to dictionary pyspark
问题
我有一个表格,如下所示
|item_name|item_value|timestamp                   |idx|description|
+---------+----------+----------------------------+---+-----------+
|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |unitA&     |
|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |null       |
|C        |0.34      |2023-03-01T17:20:00.000+0000|0  |C&         |
|D        |0.45      |2023-03-01T17:20:00.000+0000|0  |unitD&     |
|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |unitA&     |
|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |null       |
|C        |0.3       |2023-03-01T17:25:00.000+0000|1  |C&         |
|D        |0.21      |2023-03-01T17:25:00.000+0000|1  |unitD&         |
|A        |0.54      |2023-03-01T17:30:00.000+0000|2  |unitA&         |
我想将其转换为以下结构的字典
  {'unitA&': {0: 0.25, 1: 0.3, 2: 0.54},
'C&': {0: 0.34, 1: 0.3},
'unitD&': {0: 0.45, 1: 0.21}}
我该如何使用Pyspark来实现这个目标?
非常感谢您的任何帮助!
英文:
I have a table like this below
|item_name|item_value|timestamp                   |idx|description|
+---------+----------+----------------------------+---+-----------+
|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |unitA&     |
|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |null       |
|C        |0.34      |2023-03-01T17:20:00.000+0000|0  |C&         |
|D        |0.45      |2023-03-01T17:20:00.000+0000|0  |unitD&     |
|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |unitA&     |
|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |null       |
|C        |0.3       |2023-03-01T17:25:00.000+0000|1  |C&         |
|D        |0.21      |2023-03-01T17:25:00.000+0000|1  |unitD&         |
|A        |0.54      |2023-03-01T17:30:00.000+0000|2  |unitA&         |
and I want to convert it to a dict in this structure
  {'unitA&': {0: 0.25,1: 0.3,2:0.54},
'C&': {0: 0.34, 1: 0.3},
'unitD& ': {0: 0.45,1: 0.21}}
How can I do this with Pyspark?
Any help would be appreciated!
答案1
得分: 1
你可以使用内置的 Spark 函数 map_from_entries、collect_list 和 struct 来获取你需要的结果。
示例:
df.show(20,False)
#sampledata
#+---------+----------+----------------------------+---+-----------+
#|item_name|item_value|timestamp                   |idx|description|
#+---------+----------+----------------------------+---+-----------+
#|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |unitA&     |
#|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |null       |
#|C        |0.34      |2023-03-01T17:20:00.000+0000|0  |C&         |
#|D        |0.45      |2023-03-01T17:20:00.000+0000|0  |unitD&     |
#|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |unitA&     |
#|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |null       |
#|C        |0.3       |2023-03-01T17:25:00.000+0000|1  |C&         |
#|D        |0.21      |2023-03-01T17:25:00.000+0000|1  |unitD&     |
#|A        |0.54      |2023-03-01T17:30:00.000+0000|2  |unitA&     |
#+---------+----------+----------------------------+---+-----------+
df = df.filter(col("description").isNotNull()).groupBy("description").\
agg(map_from_entries(collect_list(struct(col("idx"),col("item_value")))).alias("temp_desc")).\
groupBy(lit(1)).\
agg(map_from_entries(collect_list(struct(col("description"),col("temp_desc")))).alias("a")).\
select("a")
req_dict = str(df.collect()[0][0])
print(req_dict)
结果:
{'unitD&': {0: 0.45, 1: 0.21}, 'C&': {0: 0.34, 1: 0.3}, 'unitA&': {0: 0.25, 1: 0.3, 2: 0.54}}
英文:
You can use map_from_entries,collect_list,struct inbuilt spark functions to get your required result.
Example:
df.show(20,False)
#sampledata
#+---------+----------+----------------------------+---+-----------+
#|item_name|item_value|timestamp                   |idx|description|
#+---------+----------+----------------------------+---+-----------+
#|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |unitA&     |
#|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |null       |
#|C        |0.34      |2023-03-01T17:20:00.000+0000|0  |C&         |
#|D        |0.45      |2023-03-01T17:20:00.000+0000|0  |unitD&     |
#|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |unitA&     |
#|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |null       |
#|C        |0.3       |2023-03-01T17:25:00.000+0000|1  |C&         |
#|D        |0.21      |2023-03-01T17:25:00.000+0000|1  |unitD&     |
#|A        |0.54      |2023-03-01T17:30:00.000+0000|2  |unitA&     |
#+---------+----------+----------------------------+---+-----------+
df = df.filter(col("description").isNotNull()).groupBy("description").\
agg(map_from_entries(collect_list(struct(col("idx"),col("item_value")))).alias("temp_desc")).\
groupBy(lit(1)).\
agg(map_from_entries(collect_list(struct(col("description"),col("temp_desc")))).alias("a")).\
select("a")
req_dict = str(df.collect()[0][0])
print(req_dict)
Result:
{'unitD&': {0: 0.45, 1: 0.21}, 'C&': {0: 0.34, 1: 0.3}, 'unitA&': {0: 0.25, 1: 0.3, 2: 0.54}}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论