英文:
convert dataframe to dictionary pyspark
问题
我有一个表格,如下所示
|item_name|item_value|timestamp |idx|description|
+---------+----------+----------------------------+---+-----------+
|A |0.25 |2023-03-01T17:20:00.000+0000|0 |unitA& |
|B |0.34 |2023-03-01T17:20:00.000+0000|0 |null |
|C |0.34 |2023-03-01T17:20:00.000+0000|0 |C& |
|D |0.45 |2023-03-01T17:20:00.000+0000|0 |unitD& |
|A |0.3 |2023-03-01T17:25:00.000+0000|1 |unitA& |
|B |0.54 |2023-03-01T17:25:00.000+0000|1 |null |
|C |0.3 |2023-03-01T17:25:00.000+0000|1 |C& |
|D |0.21 |2023-03-01T17:25:00.000+0000|1 |unitD& |
|A |0.54 |2023-03-01T17:30:00.000+0000|2 |unitA& |
我想将其转换为以下结构的字典
{'unitA&': {0: 0.25, 1: 0.3, 2: 0.54},
'C&': {0: 0.34, 1: 0.3},
'unitD&': {0: 0.45, 1: 0.21}}
我该如何使用Pyspark来实现这个目标?
非常感谢您的任何帮助!
英文:
I have a table like this below
|item_name|item_value|timestamp |idx|description|
+---------+----------+----------------------------+---+-----------+
|A |0.25 |2023-03-01T17:20:00.000+0000|0 |unitA& |
|B |0.34 |2023-03-01T17:20:00.000+0000|0 |null |
|C |0.34 |2023-03-01T17:20:00.000+0000|0 |C& |
|D |0.45 |2023-03-01T17:20:00.000+0000|0 |unitD& |
|A |0.3 |2023-03-01T17:25:00.000+0000|1 |unitA& |
|B |0.54 |2023-03-01T17:25:00.000+0000|1 |null |
|C |0.3 |2023-03-01T17:25:00.000+0000|1 |C& |
|D |0.21 |2023-03-01T17:25:00.000+0000|1 |unitD& |
|A |0.54 |2023-03-01T17:30:00.000+0000|2 |unitA& |
and I want to convert it to a dict in this structure
{'unitA&': {0: 0.25,1: 0.3,2:0.54},
'C&': {0: 0.34, 1: 0.3},
'unitD& ': {0: 0.45,1: 0.21}}
How can I do this with Pyspark?
Any help would be appreciated!
答案1
得分: 1
你可以使用内置的 Spark 函数 map_from_entries
、collect_list
和 struct
来获取你需要的结果。
示例:
df.show(20,False)
#sampledata
#+---------+----------+----------------------------+---+-----------+
#|item_name|item_value|timestamp |idx|description|
#+---------+----------+----------------------------+---+-----------+
#|A |0.25 |2023-03-01T17:20:00.000+0000|0 |unitA& |
#|B |0.34 |2023-03-01T17:20:00.000+0000|0 |null |
#|C |0.34 |2023-03-01T17:20:00.000+0000|0 |C& |
#|D |0.45 |2023-03-01T17:20:00.000+0000|0 |unitD& |
#|A |0.3 |2023-03-01T17:25:00.000+0000|1 |unitA& |
#|B |0.54 |2023-03-01T17:25:00.000+0000|1 |null |
#|C |0.3 |2023-03-01T17:25:00.000+0000|1 |C& |
#|D |0.21 |2023-03-01T17:25:00.000+0000|1 |unitD& |
#|A |0.54 |2023-03-01T17:30:00.000+0000|2 |unitA& |
#+---------+----------+----------------------------+---+-----------+
df = df.filter(col("description").isNotNull()).groupBy("description").\
agg(map_from_entries(collect_list(struct(col("idx"),col("item_value")))).alias("temp_desc")).\
groupBy(lit(1)).\
agg(map_from_entries(collect_list(struct(col("description"),col("temp_desc")))).alias("a")).\
select("a")
req_dict = str(df.collect()[0][0])
print(req_dict)
结果:
{'unitD&': {0: 0.45, 1: 0.21}, 'C&': {0: 0.34, 1: 0.3}, 'unitA&': {0: 0.25, 1: 0.3, 2: 0.54}}
英文:
You can use map_from_entries
,collect_list
,struct
inbuilt spark functions to get your required result.
Example:
df.show(20,False)
#sampledata
#+---------+----------+----------------------------+---+-----------+
#|item_name|item_value|timestamp |idx|description|
#+---------+----------+----------------------------+---+-----------+
#|A |0.25 |2023-03-01T17:20:00.000+0000|0 |unitA& |
#|B |0.34 |2023-03-01T17:20:00.000+0000|0 |null |
#|C |0.34 |2023-03-01T17:20:00.000+0000|0 |C& |
#|D |0.45 |2023-03-01T17:20:00.000+0000|0 |unitD& |
#|A |0.3 |2023-03-01T17:25:00.000+0000|1 |unitA& |
#|B |0.54 |2023-03-01T17:25:00.000+0000|1 |null |
#|C |0.3 |2023-03-01T17:25:00.000+0000|1 |C& |
#|D |0.21 |2023-03-01T17:25:00.000+0000|1 |unitD& |
#|A |0.54 |2023-03-01T17:30:00.000+0000|2 |unitA& |
#+---------+----------+----------------------------+---+-----------+
df = df.filter(col("description").isNotNull()).groupBy("description").\
agg(map_from_entries(collect_list(struct(col("idx"),col("item_value")))).alias("temp_desc")).\
groupBy(lit(1)).\
agg(map_from_entries(collect_list(struct(col("description"),col("temp_desc")))).alias("a")).\
select("a")
req_dict = str(df.collect()[0][0])
print(req_dict)
Result:
{'unitD&': {0: 0.45, 1: 0.21}, 'C&': {0: 0.34, 1: 0.3}, 'unitA&': {0: 0.25, 1: 0.3, 2: 0.54}}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论