将PySpark数据框分组后,将列中的元素转换为字典,以列中的元素作为键。

huangapple go评论64阅读模式
英文:

Convert PySpark data frame to dictionary after grouping the elements in the column as key

问题

我想将其转换为字典:

dict1 = {'1':['value-1','value-2','value-3'], '2':['value-1','value-2']}

我能够做到这一点(在下面写了一个答案),但我需要更简单和高效的方法,而不需要将数据框转换为 Pandas。

英文:

I have below PySpark data frame:

ID Value
1 value-1
1 value-2
1 value-3
2 value-1
2 value-2

I want to convert it into a dictionary:

dict1 = {'1':['value-1','value-2','value-3'], '2':['value-1','value-2']}

I was able to do it (wrote an answer below) but I need much simpler and efficient way without converting the data frame to Pandas.

答案1

得分: 1

这是一种简单而高效的方法,使用df.groupby().to_dict(),将产生相同的期望输出。

# 转换为 Pandas 数据帧
df_pandas = df_spark.toPandas()

dict1 = df_pandas.groupby("ID")["Value"].apply(list).to_dict()
print(dict1)

如果您想避免使用.toPandas(),可以采用以下方法:

from pyspark.sql.functions import collect_list

dict1 = df_spark.groupBy("ID").agg(collect_list("Value").alias("Values")).rdd.collectAsMap()
print(dict1)

{'1': ['value-1', 'value-2', 'value-3'], '2': ['value-1', 'value-2']}
英文:

This is a simple and efficient approach using df.groupby() and .to_dict() which will produce the same desired output.

# Convert to Pandas data frame
df_pandas = df_spark.toPandas()

dict1 = df_pandas.groupby("ID")["Value"].apply(list).to_dict()
print(dict1)

You can do the following if you want to avoid .toPandas()

from pyspark.sql.functions import collect_list

dict1 = df_spark.groupBy("ID").agg(collect_list("Value").alias("Values")).rdd.collectAsMap()
print(dict1)

{'1': ['value-1', 'value-2', 'value-3'], '2': ['value-1', 'value-2']}

答案2

得分: 1

也许你可以尝试

```python
import pyspark.sql.functions as F

records = df.groupBy('ID').agg(F.collect_list('Value').alias('List')).collect()
dict1 = {row['ID']: row['List'] for row in records}
print(dict1)

# 输出
{1: ['value-1', 'value-2', 'value-3'], 2: ['value-1', 'value-2']}

<details>
<summary>英文:</summary>

Maybe, you can try:

import pyspark.sql.functions as F

records = df.groupBy('ID').agg(F.collect_list('Value').alias('List')).collect()
dict1 = {row['ID']: row['List'] for row in records}
print(dict1)

Output

{1: ['value-1', 'value-2', 'value-3'], 2: ['value-1', 'value-2']}


</details>



# 答案3
**得分**: 0

我首先将 PySpark 数据框转换为 Pandas 数据框,然后遍历所有单元格。这是一个 O(M*N) 的遍历,但昂贵的部分是将 PySpark 数据框转换为 Pandas。

```python
import pandas as pd

# 转换为 Pandas 数据框
df_pandas = df_spark.toPandas()

# 将 Pandas 数据框转换为字典
dict1 = dict()
for i in range(0, len(df_pandas)):
    key = df_pandas.iloc[i, 0]
    if key not in dict1:
        dict1.update({key:[]})
        dict1[key].append(df_pandas.iloc[i, 1])
    else:
        dict1[key].append(df_pandas.iloc[i, 1])
英文:

I first converted the PySpark data frame to pandas data frame then iterate through all cells. This is O(M*N) to iterate but the costly part is to convert PySpark data frame to pandas.

import pandas as pd

# Convert to Pandas data frame
df_pandas = df_spark.toPandas()

# Convert pandas data frame to dictionary
dict1= dict()
for i in range(0,len(df_pandas)):
    key = df_pandas.iloc[i, 0]
    if key not in dict1:
        dict1.update({key:[]})
        dict1[key].append(df_pandas.iloc[i, 1])
    else:
        dict1[key].append(df_pandas.iloc[i, 1])

答案4

得分: 0

以下是翻译好的部分:

import pyspark.sql.functions as F

aggregation = df.groupby("ID").agg(F.collect_list("Value").alias("Value"))
dict(aggregation.rdd.map(lambda x: (x["ID"], x["Value"])).collect())
英文:

Something like this should work:

import pyspark.sql.functions as F

aggregation = df.groupby(&quot;ID&quot;).agg(F.collect_list(&quot;Value&quot;).alias(&quot;Value&quot;))
dict(aggregation.rdd.map(lambda x: (x[&quot;ID&quot;], x[&quot;Value&quot;])).collect())

huangapple
  • 本文由 发表于 2023年2月6日 20:50:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/75361535.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定