Pyspark表名与时间戳

huangapple go评论59阅读模式
英文:

Pyspark Table Name with Timestamp

问题

我在Databricks中使用Pyspark编码。我在重命名现有表格并添加时间戳方面遇到问题。该表位于mydatabase.tableOne,我想将其保存为另一个表mydatabase.tableOne_20230625。这将允许新的处理运行,创建mydatabase.tableOne的新版本。

df_my_loc = 'mydatabase.tableOne'
timestamp_suffix = date_format(current_timestamp(), 'yyyyMMdd')
df_my_loc_new = df_my_loc + '_' + timestamp_suffix
df_arch = spark.table(df_my_loc)
df_arch.write.format("delta").mode("ignore").saveAsTable(df_my_loc_new)

我收到一个错误,说列不可迭代。似乎将时间戳添加为后缀导致了这个错误。在运行新进程之前,目标是使用时间戳存档上一个表格,并运行新输出的笔记本,其中表格名称不会更改。

英文:

I'm working in databricks coding in Pyspark. I'm encountering issues with renaming an existing table and adding a time stamp. The table is located mydatabase.tableOne, and I want to save it as another table called mydatabase.tableOne_20230625. This will allow the new process to run, creating a new version of mydatabase.tableOne

df_my_loc = 'mydatabase.tableOne'
timestamp_suffix = date_format(current_timestamp(), 'yyyyMMdd')
df_my_loc_new = df_my_loc + '_' + timestamp_suffix
df_arch = spark.table(df_my_loc)
df_arch.write.format("delta").mode("ignore").saveAsTable(df_my_loc_new)

I get an error that says the column is not iterable. It seems like adding the timestamp as a suffix is giving me the error. Before running a new process, the goal is to archive the previous table with a timestamp and run a notebook for the new output where the table names do not change.

答案1

得分: 0

date_formatcurrent_timestampSpark SQL 函数。如果在 Python 中使用,就像在你的示例中一样,它们的返回类型是 Column。这个表达式:

df_my_loc_new = df_my_loc + '_ ' + timestamp_suffix

是一个 Python 表达式 - 简单字符串的连接。你不能连接类型为字符串的 StringColumn。你可以选择:

  • 使用 + 运算符连接 Python 字符串
  • 或者使用 concat 函数连接 SQL 字符串

由于构建表名是纯粹的 Python,你只需要执行以下操作:

from datetime import datetime
timestamp_suffix = datetime.now().strftime('%Y%m')

这将生成一个简单字符串,格式为 YYYYMM 的 timestamp_suffix,然后你的代码将正常工作。

英文:

date_format and current_timestamp are Spark SQL functions. If used in Python, like in your example, their return type is Column. This:

This:

df_my_loc_new = df_my_loc + '_' + timestamp_suffix

is a Python expression - concatenation of simple strings. You cannot concatenate String and Column of type String. You can either:

  • concatenate Python string using + operator
  • or concatenate SQL strings using concat function

Since building a table name is pure Python you just need to do:

from datetime import datetime
timestamp_suffix = datetime.now().strftime('%Y%m')

This will yield timestamp_suffix as a simple string in YYYYMM format and your code will work.

huangapple
  • 本文由 发表于 2023年6月26日 01:07:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76551572.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定