2023年2月6日 20:50:27go评论64阅读模式

英文:

Convert PySpark data frame to dictionary after grouping the elements in the column as key

问题

我想将其转换为字典:

dict1 = {'1':['value-1','value-2','value-3'], '2':['value-1','value-2']}

我能够做到这一点（在下面写了一个答案），但我需要更简单和高效的方法，而不需要将数据框转换为 Pandas。

英文:

I have below PySpark data frame:

ID	Value
1	value-1
1	value-2
1	value-3
2	value-1
2	value-2

I want to convert it into a dictionary:

dict1 = {&#39;1&#39;:[&#39;value-1&#39;,&#39;value-2&#39;,&#39;value-3&#39;], &#39;2&#39;:[&#39;value-1&#39;,&#39;value-2&#39;]}

I was able to do it (wrote an answer below) but I need much simpler and efficient way without converting the data frame to Pandas.

答案1

得分: 1

这是一种简单而高效的方法，使用df.groupby()和.to_dict()，将产生相同的期望输出。

# 转换为 Pandas 数据帧
df_pandas = df_spark.toPandas()

dict1 = df_pandas.groupby("ID")["Value"].apply(list).to_dict()
print(dict1)

如果您想避免使用.toPandas()，可以采用以下方法：

from pyspark.sql.functions import collect_list

dict1 = df_spark.groupBy("ID").agg(collect_list("Value").alias("Values")).rdd.collectAsMap()
print(dict1)

{'1': ['value-1', 'value-2', 'value-3'], '2': ['value-1', 'value-2']}

英文:

This is a simple and efficient approach using df.groupby() and .to_dict() which will produce the same desired output.

# Convert to Pandas data frame
df_pandas = df_spark.toPandas()

dict1 = df_pandas.groupby(&quot;ID&quot;)[&quot;Value&quot;].apply(list).to_dict()
print(dict1)

You can do the following if you want to avoid .toPandas()

from pyspark.sql.functions import collect_list

dict1 = df_spark.groupBy(&quot;ID&quot;).agg(collect_list(&quot;Value&quot;).alias(&quot;Values&quot;)).rdd.collectAsMap()
print(dict1)

{&#39;1&#39;: [&#39;value-1&#39;, &#39;value-2&#39;, &#39;value-3&#39;], &#39;2&#39;: [&#39;value-1&#39;, &#39;value-2&#39;]}

答案2

得分: 1

也许你可以尝试：

```python
import pyspark.sql.functions as F

records = df.groupBy('ID').agg(F.collect_list('Value').alias('List')).collect()
dict1 = {row['ID']: row['List'] for row in records}
print(dict1)

# 输出
{1: ['value-1', 'value-2', 'value-3'], 2: ['value-1', 'value-2']}


<details>
<summary>英文:</summary>

Maybe, you can try:

import pyspark.sql.functions as F

records = df.groupBy('ID').agg(F.collect_list('Value').alias('List')).collect()
dict1 = {row['ID']: row['List'] for row in records}
print(dict1)

Output

{1: ['value-1', 'value-2', 'value-3'], 2: ['value-1', 'value-2']}


</details>



# 答案3
**得分**: 0

我首先将 PySpark 数据框转换为 Pandas 数据框，然后遍历所有单元格。这是一个 O(M*N) 的遍历，但昂贵的部分是将 PySpark 数据框转换为 Pandas。

```python
import pandas as pd

# 转换为 Pandas 数据框
df_pandas = df_spark.toPandas()

# 将 Pandas 数据框转换为字典
dict1 = dict()
for i in range(0, len(df_pandas)):
    key = df_pandas.iloc[i, 0]
    if key not in dict1:
        dict1.update({key:[]})
        dict1[key].append(df_pandas.iloc[i, 1])
    else:
        dict1[key].append(df_pandas.iloc[i, 1])

英文:

I first converted the PySpark data frame to pandas data frame then iterate through all cells. This is O(M*N) to iterate but the costly part is to convert PySpark data frame to pandas.

import pandas as pd

# Convert to Pandas data frame
df_pandas = df_spark.toPandas()

# Convert pandas data frame to dictionary
dict1= dict()
for i in range(0,len(df_pandas)):
    key = df_pandas.iloc[i, 0]
    if key not in dict1:
        dict1.update({key:[]})
        dict1[key].append(df_pandas.iloc[i, 1])
    else:
        dict1[key].append(df_pandas.iloc[i, 1])

答案4

得分: 0

以下是翻译好的部分：

import pyspark.sql.functions as F

aggregation = df.groupby("ID").agg(F.collect_list("Value").alias("Value"))
dict(aggregation.rdd.map(lambda x: (x["ID"], x["Value"])).collect())

英文:

Something like this should work:

import pyspark.sql.functions as F

aggregation = df.groupby(&quot;ID&quot;).agg(F.collect_list(&quot;Value&quot;).alias(&quot;Value&quot;))
dict(aggregation.rdd.map(lambda x: (x[&quot;ID&quot;], x[&quot;Value&quot;])).collect())

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将PySpark数据框分组后，将列中的元素转换为字典，以列中的元素作为键。

问题

答案1

答案2

Output

答案4

执行一个 .class 文件（Java）从 .py 脚本中如何实现？

如何检查在一个范围内哪些数字的十位和个位上有相同的数字？ (Python)

如何销毁一个 QApplication，然后在不退出 Python 脚本的情况下运行一个新的？

Mapping str() over columns in a dataframe returns "TypeError: object of type 'map' has no len()"

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论