英文:
Pyspark access the values after collect_list()
问题
我在使用pyspark的collect_list()时遇到了一个看起来很傻的问题。我在StackOverflow上搜索了但找不到解决办法。
在正常聚合之后,我有以下数据集:
spark = SparkSession.builder.appName('example').getOrCreate()
data = [{'users': '1', 'songs': 23},
{'users': '1', 'songs': 28},
{'users': '2', 'songs': 43},
{'users': '2', 'songs': 63},
{'users': '3', 'songs': 78},
{'users': '3', 'songs': 33}]
# 创建一个数据框
dataframe = spark.createDataFrame(data)
songs_mean = dataframe.groupBy('users').agg({'songs':'mean'}).agg(collect_list('avg(songs)')).collect()
songs_mean
# 输出
[Row(collect_list(avg(songs))=[55.5, 25.5, 53.0])]
我如何访问嵌套列表中的值?我只想要:
[55.5, 25.5, 53.0]
提前感谢大家的帮助。
英文:
I have what I feel is a silly issue while using collect_list() with pyspark. I searched StackOverflow but I couldn't find the answer to my problem.
After a normal aggregation I have the following dataset:
spark = SparkSession.builder.appName('example').getOrCreate()
data = [{'users': '1', 'songs': 23},
{'users': '1', 'songs': 28},
{'users': '2', 'songs': 43},
{'users': '2', 'songs': 63},
{'users': '3', 'songs': 78},
{'users': '3', 'songs': 33}]
# creating a dataframe
dataframe = spark.createDataFrame(data)
songs_mean = dataframe.groupBy('users').agg({'songs':'mean'}).agg(collect_list('avg(songs)')).collect()
songs_mean
#Output
[Row(collect_list(avg(songs))=[55.5, 25.5, 53.0])]
How can I access the nested list with the values? All I want is:
[55.5, 25.5, 53.0]
Thanks in advance to everybody.
答案1
得分: 2
访问列表通过 索引 [0][0]
示例:
from pyspark.sql import *
from pyspark.sql.functions import *
data = [{'users': '1', 'songs': 23},
{'users': '1', 'songs': 28},
{'users': '2', 'songs': 43},
{'users': '2', 'songs': 63},
{'users': '3', 'songs': 78},
{'users': '3', 'songs': 33}]
# 创建一个数据框
dataframe = spark.createDataFrame(data)
songs_mean = dataframe.groupBy('users').agg({'songs':'mean'}).agg(collect_list('avg(songs)')).collect()
print(songs_mean[0][0])
#[25.5, 55.5, 53.0]
英文:
Access the list by the index [0][0]
Example:
from pyspark.sql import *
from pyspark.sql.functions import *
data = [{'users': '1', 'songs': 23},
{'users': '1', 'songs': 28},
{'users': '2', 'songs': 43},
{'users': '2', 'songs': 63},
{'users': '3', 'songs': 78},
{'users': '3', 'songs': 33}]
# creating a dataframe
dataframe = spark.createDataFrame(data)
songs_mean = dataframe.groupBy('users').agg({'songs':'mean'}).agg(collect_list('avg(songs)')).collect()
print(songs_mean[0][0])
#[25.5, 55.5, 53.0]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论