在使用 `collect_list()` 后访问数值。

huangapple go评论65阅读模式
英文:

Pyspark access the values after collect_list()

问题

我在使用pyspark的collect_list()时遇到了一个看起来很傻的问题。我在StackOverflow上搜索了但找不到解决办法。

在正常聚合之后,我有以下数据集:

spark = SparkSession.builder.appName('example').getOrCreate()
data = [{'users': '1', 'songs': 23},
        {'users': '1', 'songs': 28},
        {'users': '2', 'songs': 43},
        {'users': '2', 'songs': 63},
        {'users': '3', 'songs': 78},
        {'users': '3', 'songs': 33}]

# 创建一个数据框
dataframe = spark.createDataFrame(data)

songs_mean = dataframe.groupBy('users').agg({'songs':'mean'}).agg(collect_list('avg(songs)')).collect()
songs_mean

# 输出
[Row(collect_list(avg(songs))=[55.5, 25.5, 53.0])]

我如何访问嵌套列表中的值?我只想要:

[55.5, 25.5, 53.0]

提前感谢大家的帮助。

英文:

I have what I feel is a silly issue while using collect_list() with pyspark. I searched StackOverflow but I couldn't find the answer to my problem.

After a normal aggregation I have the following dataset:

spark = SparkSession.builder.appName('example').getOrCreate()
data = [{'users': '1', 'songs': 23},
        {'users': '1', 'songs': 28},
        {'users': '2', 'songs': 43},
        {'users': '2', 'songs': 63},
        {'users': '3', 'songs': 78},
        {'users': '3', 'songs': 33}]
  
# creating a dataframe
dataframe = spark.createDataFrame(data)

songs_mean = dataframe.groupBy('users').agg({'songs':'mean'}).agg(collect_list('avg(songs)')).collect()
songs_mean

#Output
[Row(collect_list(avg(songs))=[55.5, 25.5, 53.0])]

How can I access the nested list with the values? All I want is:

[55.5, 25.5, 53.0]

Thanks in advance to everybody.

答案1

得分: 2

访问列表通过 索引 [0][0]

示例:

from pyspark.sql import *
from pyspark.sql.functions import *
data = [{'users': '1', 'songs': 23},
        {'users': '1', 'songs': 28},
        {'users': '2', 'songs': 43},
        {'users': '2', 'songs': 63},
        {'users': '3', 'songs': 78},
        {'users': '3', 'songs': 33}]

# 创建一个数据框
dataframe = spark.createDataFrame(data)

songs_mean = dataframe.groupBy('users').agg({'songs':'mean'}).agg(collect_list('avg(songs)')).collect()
print(songs_mean[0][0])
#[25.5, 55.5, 53.0]
英文:

Access the list by the index [0][0]

Example:

from pyspark.sql import *
from pyspark.sql.functions import *
data = [{'users': '1', 'songs': 23},
        {'users': '1', 'songs': 28},
        {'users': '2', 'songs': 43},
        {'users': '2', 'songs': 63},
        {'users': '3', 'songs': 78},
        {'users': '3', 'songs': 33}]
  
# creating a dataframe
dataframe = spark.createDataFrame(data)

songs_mean = dataframe.groupBy('users').agg({'songs':'mean'}).agg(collect_list('avg(songs)')).collect()
print(songs_mean[0][0])
#[25.5, 55.5, 53.0]

huangapple
  • 本文由 发表于 2023年3月23日 10:04:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75818714.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定