将数据框转换为字典Pyspark。

huangapple go评论74阅读模式
英文:

convert dataframe to dictionary pyspark

问题

我有一个表格,如下所示

|item_name|item_value|timestamp                   |idx|description|
+---------+----------+----------------------------+---+-----------+
|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |unitA&     |
|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |null       |
|C        |0.34      |2023-03-01T17:20:00.000+0000|0  |C&         |
|D        |0.45      |2023-03-01T17:20:00.000+0000|0  |unitD&     |
|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |unitA&     |
|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |null       |
|C        |0.3       |2023-03-01T17:25:00.000+0000|1  |C&         |
|D        |0.21      |2023-03-01T17:25:00.000+0000|1  |unitD&         |
|A        |0.54      |2023-03-01T17:30:00.000+0000|2  |unitA&         |

我想将其转换为以下结构的字典

  {'unitA&': {0: 0.25, 1: 0.3, 2: 0.54},
'C&': {0: 0.34, 1: 0.3},
'unitD&': {0: 0.45, 1: 0.21}}

我该如何使用Pyspark来实现这个目标?

非常感谢您的任何帮助!

英文:

I have a table like this below

|item_name|item_value|timestamp                   |idx|description|
+---------+----------+----------------------------+---+-----------+
|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |unitA&     |
|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |null       |
|C        |0.34      |2023-03-01T17:20:00.000+0000|0  |C&         |
|D        |0.45      |2023-03-01T17:20:00.000+0000|0  |unitD&     |
|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |unitA&     |
|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |null       |
|C        |0.3       |2023-03-01T17:25:00.000+0000|1  |C&         |
|D        |0.21      |2023-03-01T17:25:00.000+0000|1  |unitD&         |
|A        |0.54      |2023-03-01T17:30:00.000+0000|2  |unitA&         |

and I want to convert it to a dict in this structure

  {'unitA&': {0: 0.25,1: 0.3,2:0.54},
'C&': {0: 0.34, 1: 0.3},
'unitD& ': {0: 0.45,1: 0.21}}

How can I do this with Pyspark?

Any help would be appreciated!

答案1

得分: 1

你可以使用内置的 Spark 函数 map_from_entriescollect_liststruct 来获取你需要的结果。

示例:

df.show(20,False)
#sampledata
#+---------+----------+----------------------------+---+-----------+
#|item_name|item_value|timestamp                   |idx|description|
#+---------+----------+----------------------------+---+-----------+
#|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |unitA&     |
#|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |null       |
#|C        |0.34      |2023-03-01T17:20:00.000+0000|0  |C&         |
#|D        |0.45      |2023-03-01T17:20:00.000+0000|0  |unitD&     |
#|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |unitA&     |
#|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |null       |
#|C        |0.3       |2023-03-01T17:25:00.000+0000|1  |C&         |
#|D        |0.21      |2023-03-01T17:25:00.000+0000|1  |unitD&     |
#|A        |0.54      |2023-03-01T17:30:00.000+0000|2  |unitA&     |
#+---------+----------+----------------------------+---+-----------+

df = df.filter(col("description").isNotNull()).groupBy("description").\
agg(map_from_entries(collect_list(struct(col("idx"),col("item_value")))).alias("temp_desc")).\
groupBy(lit(1)).\
agg(map_from_entries(collect_list(struct(col("description"),col("temp_desc")))).alias("a")).\
select("a")

req_dict = str(df.collect()[0][0])
print(req_dict)

结果:

{'unitD&': {0: 0.45, 1: 0.21}, 'C&': {0: 0.34, 1: 0.3}, 'unitA&': {0: 0.25, 1: 0.3, 2: 0.54}}
英文:

You can use map_from_entries,collect_list,struct inbuilt spark functions to get your required result.

Example:

df.show(20,False)
#sampledata
#+---------+----------+----------------------------+---+-----------+
#|item_name|item_value|timestamp                   |idx|description|
#+---------+----------+----------------------------+---+-----------+
#|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |unitA&     |
#|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |null       |
#|C        |0.34      |2023-03-01T17:20:00.000+0000|0  |C&         |
#|D        |0.45      |2023-03-01T17:20:00.000+0000|0  |unitD&     |
#|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |unitA&     |
#|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |null       |
#|C        |0.3       |2023-03-01T17:25:00.000+0000|1  |C&         |
#|D        |0.21      |2023-03-01T17:25:00.000+0000|1  |unitD&     |
#|A        |0.54      |2023-03-01T17:30:00.000+0000|2  |unitA&     |
#+---------+----------+----------------------------+---+-----------+

df = df.filter(col("description").isNotNull()).groupBy("description").\
agg(map_from_entries(collect_list(struct(col("idx"),col("item_value")))).alias("temp_desc")).\
groupBy(lit(1)).\
agg(map_from_entries(collect_list(struct(col("description"),col("temp_desc")))).alias("a")).\
select("a")

req_dict = str(df.collect()[0][0])
print(req_dict)

Result:

{'unitD&': {0: 0.45, 1: 0.21}, 'C&': {0: 0.34, 1: 0.3}, 'unitA&': {0: 0.25, 1: 0.3, 2: 0.54}}

huangapple
  • 本文由 发表于 2023年3月12日 08:08:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/75710324.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定