pandas:在指定秒数为浮点数时丢失精度

huangapple go评论56阅读模式
英文:

pandas: loss of precision when specifying seconds as floating point

问题

我需要创建一个包含5000个元素的日期时间索引,具有未知的偏移量和元素之间的未知增量。增量值和偏移量是参数,唯一确定的是它们将以整数或浮点数的形式表示为秒。

我使用 pd.Timedelta(value, "s") 来计算这个增量(因为 np.timedelta64() 不接受浮点数值)。

pd.to_datetime(1687957943.122, unit="s") + np.arange(0, 5000) * pd.Timedelta(0.002, "s")

不幸的是,浮点数运算会导致精度损失(以下数字不是精确相差0.002秒):

array(['2023-06-28T13:12:23.121999872', '2023-06-28T13:12:23.123999872',
'2023-06-28T13:12:23.125999872', ...,
'2023-06-28T13:12:33.115999872', '2023-06-28T13:12:33.117999872',
'2023-06-28T13:12:33.119999872'], dtype='datetime64[ns]')

比较一下:

# 偏移量手动升级为整数并指定单位为毫秒
pd.to_datetime(1687957943122, unit="ms") + np.arange(0, 5000) * pd.Timedelta(0.002, "s")

这会得到我想要的结果:

array(['2023-06-28T13:12:23.122000000', '2023-06-28T13:12:23.124000000',
'2023-06-28T13:12:23.126000000', ...,
'2023-06-28T13:12:33.116000000', '2023-06-28T13:12:33.118000000',
'2023-06-28T13:12:33.120000000'], dtype='datetime64[ns]')

然而,由于我不知道偏移量的时间精度,我不能简单地这样做。

我可能可以编写一些代码来确定正确的单位,但感觉这应该已经是一些内置功能了。如果我不需要 pandas 的话是否有任何提示? +1 如果不需要 pandas。

英文:

I need to create an datetime index with 5000 elements, an unknown offset and an unknown delta between the elements. The delta value and the offset are parameters and the only certainty is that they will be expressed in seconds as an integer or floating-point number.

I use pd.Timedelta(value, "s") to compute this delta (since np.timedelta64() does not accept floating-point values).

pd.to_datetime(1687957943.122, unit="s") + np.arange(0, 5000) * pd.Timedelta(0.002, "s")

unfortunately, the floating-point arithmetic causes loss of precision (the following numbers aren't exactly 0.002 seconds apart):

> array(['2023-06-28T13:12:23.121999872', '2023-06-28T13:12:23.123999872',
'2023-06-28T13:12:23.125999872', ...,
'2023-06-28T13:12:33.115999872', '2023-06-28T13:12:33.117999872',
'2023-06-28T13:12:33.119999872'], dtype='datetime64[ns]')

compare:

# offset manually upgraded to integer number and unit specified as ms
pd.to_datetime(1687957943122, unit="ms") + np.arange(0, 5000) * pd.Timedelta(0.002, "s")

this gets me the desired result:

> array(['2023-06-28T13:12:23.122000000', '2023-06-28T13:12:23.124000000',
'2023-06-28T13:12:23.126000000', ...,
'2023-06-28T13:12:33.116000000', '2023-06-28T13:12:33.118000000',
'2023-06-28T13:12:33.120000000'], dtype='datetime64[ns]')

However, since I don't know the time precision of the offset, I cannot simply do this.

I could probably write some code to determine the correct unit, but it feels like this shoudl be some built-in functionality already. Any clues? +1 if I don't need pandas at all.

答案1

得分: 1

Sure, here are the translated parts:

问题已经从以下开始:

g = pd.to_datetime(1687957943.122, unit="s")
g.microsecond  # == 121999

为了避免这种行为,您需要使用pd.Timestamp.fromtimestamp()函数:

g = pd.Timestamp.fromtimestamp(1687957943.122)
g.microsecond   # == 122000

至于不使用 pandas 的解决方案:

g = datetime.fromtimestamp(1687957943.122)
g = pd.to_datetime(g)
g.microsecond  # == 122000

我确实想知道背后是如何实现的,但这回答了主要问题。

英文:

So the problem already starts at:

g = pd.to_datetime(1687957943.122, unit="s")
g.microsecond  # == 121999

You need to use the pd.Timestamp.fromtimestamp() function to avoid such behaviour:

g = pd.Timestamp.fromtimestamp(1687957943.122)
g.microsecond   # == 122000

As for the solution without pandas:

g = datetime.fromtimestamp(1687957943.122)
g = pd.to_datetime(g)
g.microsecond  # == 122000

I do wonder how it is done behind the scenes, but this answers the main question.

huangapple
  • 本文由 发表于 2023年7月6日 18:57:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76628091.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定