小数精度超过最大精度,尽管小数具有正确的大小和精度。

huangapple go评论118阅读模式
英文:

Decimal precision exceeds max precision despite decimal having the correct size and precision

问题

我有这段代码

spark = (
    SparkSession.builder
    .appName("pyspark-sandbox")
    .getOrCreate()
)
spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")  # type: ignore

faker = Faker()
value = faker.pydecimal(left_digits=28, right_digits=10)
print(value) # 8824750032877062776842530687.8719544506

df = spark.createDataFrame([[value]], schema=['DecimalItem'])
df = df.withColumn('DecimalItem', col('DecimalItem').cast(DecimalType(38, 10)))
df.show()

但在show时出现以下错误:

org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 46 exceeds max precision 38.

这个值8824750032877062776842530687.8719544506似乎适合DecimalType,但仍然失败。问题是什么?

英文:

I have this code

spark = (
    SparkSession.builder
    .appName("pyspark-sandbox")
    .getOrCreate()
)
spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")  # type: ignore

faker = Faker()
value = faker.pydecimal(left_digits=28, right_digits=10)
print(value) # 8824750032877062776842530687.8719544506

df = spark.createDataFrame([[value]], schema=['DecimalItem'])
df = df.withColumn('DecimalItem', col('DecimalItem').cast(DecimalType(38, 10)))
df.show()

But on show I get this error

org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 46 exceeds max precision 38.

The value 8824750032877062776842530687.8719544506 seems to fit into DecimalType, yet it fails. What is the problem?

答案1

得分: 1

在进行了一些调查后,我发现如果在创建数据框时传递了StructType模式,它将正常工作,没有问题。

spark = (
    SparkSession.builder
    .appName("pyspark-sandbox")
    .getOrCreate()
)
spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")  # type: ignore

faker = Faker()
value = faker.pydecimal(left_digits=28, right_digits=10)
print(value) # 8824750032877062776842530687.8719544506

df = spark.createDataFrame([[value]], schema=StructType([StructField('DecimalItem', DecimalType(38, 10), True)]))
df.show() 

虽然不确定为什么会出现cast问题。我猜想如果Pyspark自行确定了DecimalType,就无法将其强制转换为不同的DecimalType?

因为如果我在原始方法中使用df.printSchema来检查模式

DecimalItem: decimal(38,18) (nullable = true)

它默认选择了38,18

英文:

After some investigation, I found out that if you pass StructType schema, when creating a dataframe - it works properly without issues.

spark = (
    SparkSession.builder
    .appName("pyspark-sandbox")
    .getOrCreate()
)
spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")  # type: ignore

faker = Faker()
value = faker.pydecimal(left_digits=28, right_digits=10)
print(value) # 8824750032877062776842530687.8719544506

df = spark.createDataFrame([[value]], schema=StructType([StructField('DecimalItem', DecimalType(38, 10), True)]))
df.show() 

Not sure why there is an issue with cast though. I guess you can't cast to a different DecimalType if Pyspark came up with its own?

Because if I check the schema using df.printSchema in my original approach

DecimalItem: decimal(38,18) (nullable = true)

it went with 38,18 by default.

huangapple
  • 本文由 发表于 2023年6月5日 16:34:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76404712.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定