2023年3月12日 08:08:40go评论89阅读模式

英文:

convert dataframe to dictionary pyspark

问题

我有一个表格，如下所示

|item_name|item_value|timestamp                   |idx|description|
+---------+----------+----------------------------+---+-----------+
|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |unitA&amp;     |
|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |null       |
|C        |0.34      |2023-03-01T17:20:00.000+0000|0  |C&amp;         |
|D        |0.45      |2023-03-01T17:20:00.000+0000|0  |unitD&amp;     |
|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |unitA&amp;     |
|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |null       |
|C        |0.3       |2023-03-01T17:25:00.000+0000|1  |C&amp;         |
|D        |0.21      |2023-03-01T17:25:00.000+0000|1  |unitD&amp;         |
|A        |0.54      |2023-03-01T17:30:00.000+0000|2  |unitA&amp;         |

我想将其转换为以下结构的字典

  {'unitA&amp;': {0: 0.25, 1: 0.3, 2: 0.54},
'C&amp;': {0: 0.34, 1: 0.3},
'unitD&amp;': {0: 0.45, 1: 0.21}}

我该如何使用Pyspark来实现这个目标？

非常感谢您的任何帮助！

英文:

I have a table like this below

|item_name|item_value|timestamp                   |idx|description|
+---------+----------+----------------------------+---+-----------+
|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |unitA&amp;     |
|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |null       |
|C        |0.34      |2023-03-01T17:20:00.000+0000|0  |C&amp;         |
|D        |0.45      |2023-03-01T17:20:00.000+0000|0  |unitD&amp;     |
|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |unitA&amp;     |
|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |null       |
|C        |0.3       |2023-03-01T17:25:00.000+0000|1  |C&amp;         |
|D        |0.21      |2023-03-01T17:25:00.000+0000|1  |unitD&amp;         |
|A        |0.54      |2023-03-01T17:30:00.000+0000|2  |unitA&amp;         |

and I want to convert it to a dict in this structure

  {&#39;unitA&amp;&#39;: {0: 0.25,1: 0.3,2:0.54},
&#39;C&amp;&#39;: {0: 0.34, 1: 0.3},
&#39;unitD&amp; &#39;: {0: 0.45,1: 0.21}}

How can I do this with Pyspark?

Any help would be appreciated!

答案1

得分: 1

你可以使用内置的 Spark 函数 map_from_entries、collect_list 和 struct 来获取你需要的结果。

示例:

df.show(20,False)
#sampledata
#+---------+----------+----------------------------+---+-----------+
#|item_name|item_value|timestamp                   |idx|description|
#+---------+----------+----------------------------+---+-----------+
#|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |unitA&amp;     |
#|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |null       |
#|C        |0.34      |2023-03-01T17:20:00.000+0000|0  |C&amp;         |
#|D        |0.45      |2023-03-01T17:20:00.000+0000|0  |unitD&amp;     |
#|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |unitA&amp;     |
#|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |null       |
#|C        |0.3       |2023-03-01T17:25:00.000+0000|1  |C&amp;         |
#|D        |0.21      |2023-03-01T17:25:00.000+0000|1  |unitD&amp;     |
#|A        |0.54      |2023-03-01T17:30:00.000+0000|2  |unitA&amp;     |
#+---------+----------+----------------------------+---+-----------+
df = df.filter(col("description").isNotNull()).groupBy("description").\
agg(map_from_entries(collect_list(struct(col("idx"),col("item_value")))).alias("temp_desc")).\
groupBy(lit(1)).\
agg(map_from_entries(collect_list(struct(col("description"),col("temp_desc")))).alias("a")).\
select("a")
req_dict = str(df.collect()[0][0])
print(req_dict)

结果:

{'unitD&amp;': {0: 0.45, 1: 0.21}, 'C&amp;': {0: 0.34, 1: 0.3}, 'unitA&amp;': {0: 0.25, 1: 0.3, 2: 0.54}}

英文:

You can use map_from_entries,collect_list,struct inbuilt spark functions to get your required result.

Example:

df.show(20,False)
#sampledata
#+---------+----------+----------------------------+---+-----------+
#|item_name|item_value|timestamp                   |idx|description|
#+---------+----------+----------------------------+---+-----------+
#|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |unitA&amp;     |
#|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |null       |
#|C        |0.34      |2023-03-01T17:20:00.000+0000|0  |C&amp;         |
#|D        |0.45      |2023-03-01T17:20:00.000+0000|0  |unitD&amp;     |
#|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |unitA&amp;     |
#|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |null       |
#|C        |0.3       |2023-03-01T17:25:00.000+0000|1  |C&amp;         |
#|D        |0.21      |2023-03-01T17:25:00.000+0000|1  |unitD&amp;     |
#|A        |0.54      |2023-03-01T17:30:00.000+0000|2  |unitA&amp;     |
#+---------+----------+----------------------------+---+-----------+
df = df.filter(col(&quot;description&quot;).isNotNull()).groupBy(&quot;description&quot;).\
agg(map_from_entries(collect_list(struct(col(&quot;idx&quot;),col(&quot;item_value&quot;)))).alias(&quot;temp_desc&quot;)).\
groupBy(lit(1)).\
agg(map_from_entries(collect_list(struct(col(&quot;description&quot;),col(&quot;temp_desc&quot;)))).alias(&quot;a&quot;)).\
select(&quot;a&quot;)
req_dict = str(df.collect()[0][0])
print(req_dict)

Result:

{&#39;unitD&amp;&#39;: {0: 0.45, 1: 0.21}, &#39;C&amp;&#39;: {0: 0.34, 1: 0.3}, &#39;unitA&amp;&#39;: {0: 0.25, 1: 0.3, 2: 0.54}}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将数据框转换为字典Pyspark。

问题

答案1

org.apache.kafka.common.KafkaException: 构建 Kafka 生产者失败

Spark Java: 在向量汇聚器中转义列名称中的点号

阅读 PySpark

如何将数据框传递给不同的函数，并进行筛选和分组。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。