英文:
Convert PySpark data frame to dictionary after grouping the elements in the column as key
问题
我想将其转换为字典:
dict1 = {'1':['value-1','value-2','value-3'], '2':['value-1','value-2']}
我能够做到这一点(在下面写了一个答案),但我需要更简单和高效的方法,而不需要将数据框转换为 Pandas。
英文:
I have below PySpark data frame:
ID | Value |
---|---|
1 | value-1 |
1 | value-2 |
1 | value-3 |
2 | value-1 |
2 | value-2 |
I want to convert it into a dictionary:
dict1 = {'1':['value-1','value-2','value-3'], '2':['value-1','value-2']}
I was able to do it (wrote an answer below) but I need much simpler and efficient way without converting the data frame to Pandas.
答案1
得分: 1
这是一种简单而高效的方法,使用df.groupby()
和.to_dict()
,将产生相同的期望输出。
# 转换为 Pandas 数据帧
df_pandas = df_spark.toPandas()
dict1 = df_pandas.groupby("ID")["Value"].apply(list).to_dict()
print(dict1)
如果您想避免使用.toPandas()
,可以采用以下方法:
from pyspark.sql.functions import collect_list
dict1 = df_spark.groupBy("ID").agg(collect_list("Value").alias("Values")).rdd.collectAsMap()
print(dict1)
{'1': ['value-1', 'value-2', 'value-3'], '2': ['value-1', 'value-2']}
英文:
This is a simple and efficient approach using df.groupby()
and .to_dict()
which will produce the same desired output.
# Convert to Pandas data frame
df_pandas = df_spark.toPandas()
dict1 = df_pandas.groupby("ID")["Value"].apply(list).to_dict()
print(dict1)
You can do the following if you want to avoid .toPandas()
from pyspark.sql.functions import collect_list
dict1 = df_spark.groupBy("ID").agg(collect_list("Value").alias("Values")).rdd.collectAsMap()
print(dict1)
{'1': ['value-1', 'value-2', 'value-3'], '2': ['value-1', 'value-2']}
答案2
得分: 1
也许你可以尝试:
```python
import pyspark.sql.functions as F
records = df.groupBy('ID').agg(F.collect_list('Value').alias('List')).collect()
dict1 = {row['ID']: row['List'] for row in records}
print(dict1)
# 输出
{1: ['value-1', 'value-2', 'value-3'], 2: ['value-1', 'value-2']}
<details>
<summary>英文:</summary>
Maybe, you can try:
import pyspark.sql.functions as F
records = df.groupBy('ID').agg(F.collect_list('Value').alias('List')).collect()
dict1 = {row['ID']: row['List'] for row in records}
print(dict1)
Output
{1: ['value-1', 'value-2', 'value-3'], 2: ['value-1', 'value-2']}
</details>
# 答案3
**得分**: 0
我首先将 PySpark 数据框转换为 Pandas 数据框,然后遍历所有单元格。这是一个 O(M*N) 的遍历,但昂贵的部分是将 PySpark 数据框转换为 Pandas。
```python
import pandas as pd
# 转换为 Pandas 数据框
df_pandas = df_spark.toPandas()
# 将 Pandas 数据框转换为字典
dict1 = dict()
for i in range(0, len(df_pandas)):
key = df_pandas.iloc[i, 0]
if key not in dict1:
dict1.update({key:[]})
dict1[key].append(df_pandas.iloc[i, 1])
else:
dict1[key].append(df_pandas.iloc[i, 1])
英文:
I first converted the PySpark data frame to pandas data frame then iterate through all cells. This is O(M*N) to iterate but the costly part is to convert PySpark data frame to pandas.
import pandas as pd
# Convert to Pandas data frame
df_pandas = df_spark.toPandas()
# Convert pandas data frame to dictionary
dict1= dict()
for i in range(0,len(df_pandas)):
key = df_pandas.iloc[i, 0]
if key not in dict1:
dict1.update({key:[]})
dict1[key].append(df_pandas.iloc[i, 1])
else:
dict1[key].append(df_pandas.iloc[i, 1])
答案4
得分: 0
以下是翻译好的部分:
import pyspark.sql.functions as F
aggregation = df.groupby("ID").agg(F.collect_list("Value").alias("Value"))
dict(aggregation.rdd.map(lambda x: (x["ID"], x["Value"])).collect())
英文:
Something like this should work:
import pyspark.sql.functions as F
aggregation = df.groupby("ID").agg(F.collect_list("Value").alias("Value"))
dict(aggregation.rdd.map(lambda x: (x["ID"], x["Value"])).collect())
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论