英文:
How to iterate over 'Row' values in pyspark?
问题
在PySpark中迭代Row
对象的数据,你可以使用以下方法:
# 导入相关库
from pyspark.sql import SparkSession
# 创建Spark会话
spark = SparkSession.builder.appName("example").getOrCreate()
# 假设你的数据在一个DataFrame中
timeStamp_df = ...
# 使用collect()方法将DataFrame的数据收集到本地列表中
rows_list = timeStamp_df.collect()
# 遍历每一行并输出时间戳值
for row in rows_list:
print_row(row)
接下来,定义print_row
函数来打印时间戳的值:
def print_row(row):
print(row.timeStamp)
# 使用print_row函数遍历每一行并打印时间戳值
for row in rows_list:
print_row(row)
这样,你将能够遍历DataFrame中的每一行,并输出时间戳的值。希望这对你有所帮助。
英文:
I have this data as output when i perform timeStamp_df.head()
in pyspark:
Row(timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-05-03T11:30:16.900+0000)', timeStamp='ISODate(2020-04-03T11:30:16.900+0000)')
My expected output is:
+-------------------------------+
|timeStamp |
+-------------------------------+
|2020-06-03T11:30:16.900+0000|
|2020-06-03T11:30:16.900+0000|
|2020-06-03T11:30:16.900+0000|
|2020-05-03T11:30:16.900+0000|
|2020-04-03T11:30:16.900+0000|
+-------------------------------+
I tried to first use .collect()
method and want to iterate
rows_list = timeStamp_df.collect()
print(rows_list)
It's output is:
[Row(timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-05-03T11:30:16.900+0000)', timeStamp='ISODate(2020-04-03T11:30:16.900+0000)')]
Just to see the values I am using the print statement:
def print_row(row):
print(row.timeStamp)
for row in rows_list:
print_row(row)
But I am getting the single output as it only iterates once in list:
ISODate(2020-06-03T11:30:16.900+0000)
How can I iterate over the data of Row in pyspark?
答案1
得分: 2
- 创建
Row
时无法重复关键字参数。 - 有效的
Row
可迭代:
row = Row(a=10, b=20, c=30)
print([column for column in row])
[10, 20, 30]
英文:
- You cannot repeat keyword arguments when creating a
Row
. - A valid
Row
is iterable:
row = Row(a=10, b=20, c=30)
print([column for column in row])
[10, 20, 30]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论