Spark DataFrame将字符串转换为日期会导致空值

huangapple go评论67阅读模式
英文:

Spark DataFrame casting string to date results in null values

问题

I get null when I attempt to cast string date in Spark DataFrame to date type.

# 创建数据列表
data = [(1, "20230517"), (2, "20230518"), (3, "20230519"), (4, "null")]

# 从数据列表创建DataFrame
df = spark.createDataFrame(data, ("id", "date"))

df.show()

df.printSchema()

root
 |-- id: long (nullable = true)
 |-- date: string (nullable = true)

# 将SaleDate列转换为datetime格式
df1 = df.withColumn("date", df.date.cast('date'))
df1.select('date').show()

+--------+
|    date|
+--------+
|    null|
|    null|
|    null|
|    null|
英文:

I get null when I attempt to cast string date in Spark DataFrame to date type.

# Create a list of data
data = [(1, "20230517"), (2, "20230518"), (3, "20230519"), (4, "null")]

# Create a DataFrame from the list of data
df = spark.createDataFrame(data, ("id", "date"))

df.show()


df.printSchema()

root
 |-- id: long (nullable = true)
 |-- date: string (nullable = true)


# Convert the SaleDate column to datetime format
df1 = df.withColumn("date", df.date.cast('date'))
df1.select('date').show()

+--------+
|date    |
+--------+
|    null|
|    null|
|    null|
|    null|

答案1

得分: 1

For this operation you should use F.to_date() and specify the format which you want to parse (yyyyMMdd in your case):

F.to_date('date', format='yyyyMMdd')

Full code I used:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.appName('spark_session').getOrCreate()

# Create a list of data
data = [(1, "20230517"), (2, "20230518"), (3, "20230519"), (4, "null")]

# Create a DataFrame from the list of data
df = spark.createDataFrame(data, ("id", "date"))

# Convert the SaleDate column to datetime format
df1 = df.withColumn("date", F.to_date('date', format='yyyyMMdd'))
df1.select('date').show()

+----------+
|      date|
+----------+
|2023-05-17|
|2023-05-18|
|2023-05-19|
|      null|
+----------+
英文:

For this operation you should use F.to_date() and specify the format which you want to parse (yyyyMMdd in your case):

F.to_date('date', format='yyyyMMdd')

Full code I used:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.appName('spark_session').getOrCreate()

# Create a list of data
data = [(1, "20230517"), (2, "20230518"), (3, "20230519"), (4, "null")]

# Create a DataFrame from the list of data
df = spark.createDataFrame(data, ("id", "date"))

# Convert the SaleDate column to datetime format
df1 = df.withColumn("date", F.to_date('date', format='yyyyMMdd'))
df1.select('date').show()

+----------+
|      date|
+----------+
|2023-05-17|
|2023-05-18|
|2023-05-19|
|      null|
+----------+

huangapple
  • 本文由 发表于 2023年5月18日 06:12:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76276514.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定