如何在pyspark中迭代’Row’值? “`python # 代码不需要翻译 “`

huangapple go评论65阅读模式
英文:

How to iterate over 'Row' values in pyspark?

问题

在PySpark中迭代Row对象的数据,你可以使用以下方法:

# 导入相关库
from pyspark.sql import SparkSession

# 创建Spark会话
spark = SparkSession.builder.appName("example").getOrCreate()

# 假设你的数据在一个DataFrame中
timeStamp_df = ...

# 使用collect()方法将DataFrame的数据收集到本地列表中
rows_list = timeStamp_df.collect()

# 遍历每一行并输出时间戳值
for row in rows_list:
    print_row(row)

接下来,定义print_row函数来打印时间戳的值:

def print_row(row):
    print(row.timeStamp)

# 使用print_row函数遍历每一行并打印时间戳值
for row in rows_list:
    print_row(row)

这样,你将能够遍历DataFrame中的每一行,并输出时间戳的值。希望这对你有所帮助。

英文:

I have this data as output when i perform timeStamp_df.head() in pyspark:

Row(timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-05-03T11:30:16.900+0000)', timeStamp='ISODate(2020-04-03T11:30:16.900+0000)')

My expected output is:

+-------------------------------+
|timeStamp                      |
+-------------------------------+
|2020-06-03T11:30:16.900+0000|
|2020-06-03T11:30:16.900+0000|
|2020-06-03T11:30:16.900+0000|
|2020-05-03T11:30:16.900+0000|
|2020-04-03T11:30:16.900+0000|
+-------------------------------+

I tried to first use .collect() method and want to iterate

rows_list = timeStamp_df.collect()
print(rows_list)

It's output is:

[Row(timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-06-03T11:30:16.900+0000)', timeStamp='ISODate(2020-05-03T11:30:16.900+0000)', timeStamp='ISODate(2020-04-03T11:30:16.900+0000)')]

Just to see the values I am using the print statement:

def print_row(row):
    print(row.timeStamp)


for row in rows_list:
    print_row(row)

But I am getting the single output as it only iterates once in list:

ISODate(2020-06-03T11:30:16.900+0000)

How can I iterate over the data of Row in pyspark?

答案1

得分: 2

  1. 创建 Row 时无法重复关键字参数。
  2. 有效的 Row 可迭代:
row = Row(a=10, b=20, c=30)
print([column for column in row])

[10, 20, 30]
英文:
  1. You cannot repeat keyword arguments when creating a Row.
  2. A valid Row is iterable:
row = Row(a=10, b=20, c=30)
print([column for column in row])

[10, 20, 30]

huangapple
  • 本文由 发表于 2023年7月20日 20:05:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76729670.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定