英文:
Numpy vectorization messes up data type
问题
使用 pandas
数据框时,常见的情况是创建一个名为 B
的列,其中包含列 A
中的信息。
背景
在某些情况下,可以一次完成这个操作(df['B'] = df['A'] + 4
),但在其他情况下,操作会更复杂,需要编写一个单独的函数。在这种情况下,我知道有两种方法可以应用这个函数:
def calc_b(a):
return a + 4
df = pd.DataFrame({'A': np.random.randint(0, 50, 5)})
df['B1'] = df['A'].apply(lambda x: calc_b(x))
df['B2'] = np.vectorize(calc_b)(df['A'])
结果数据框如下:
A B1 B2
0 17 21 21
1 25 29 29
2 6 10 10
3 21 25 25
4 14 18 18
很好 - 两种方法都得到了正确的结果。在我的代码中,我一直使用 np.vectorize
的方法,因为.apply
很慢,而且被认为是不良实践。
现在出现了我的问题
在处理日期/时间戳时,这种方法似乎出现了问题。一个最小的工作示例如下:
def is_past_midmonth(dt):
return (dt.day > 15)
df = pd.DataFrame({'date':pd.date_range('2020-01-01', freq='6D', periods=7)})
df['past_midmonth1'] = df['date'].apply(lambda x: is_past_midmonth(x))
df['past_midmonth2'] = np.vectorize(is_past_midmonth)(df['date'])
.apply
方法有效;结果数据框如下:
date past_midmonth1
0 2020-01-01 False
1 2020-01-07 False
2 2020-01-13 False
3 2020-01-19 True
4 2020-01-25 True
5 2020-01-31 True
6 2020-02-06 False
但 np.vectorize
方法失败,报错 AttributeError: 'numpy.datetime64' object has no attribute 'day'
。
通过使用 type()
进行一些调查,df['date']
的元素是 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
类的实例,这也是函数接收到它们的方式。然而,在矢量化函数中,它们被接收为 <class 'numpy.datetime64'>
的实例,这就导致了错误。
我有两个问题:
- 是否有办法“修复”
np.vectorize
的这种行为?如何修复? - 如何一般性地避免这种类型的不兼容性?
当然,我可以心里记住不使用接受日期时间参数的np.vectorize
函数,但那很麻烦。我想要一个能始终工作的解决方案,这样我在遇到这种情况时就不必考虑它。
如上所述,这是一个演示问题的最小工作示例。我知道在这种情况下可以使用更简单的一次性操作,就像在第一个示例中使用整数列一样。但这不是重点;我对一般情况下对任何接受时间戳参数的函数进行矢量化的方法感兴趣。对于那些要求更具体/复杂示例的人,我创建了一个在这里。
编辑:我在想是否使用类型提示会有所不同 - 如果 numpy
实际上会考虑这些信息 - 但我怀疑不会,因为使用这个签名 def is_past_midmonth(dt: float) -> bool:
,其中 float
显然是错误的,会导致相同的错误。我对类型提示还相当陌生,而且我没有支持它的IDE,所以很难进行调试。
非常感谢!
英文:
When using pandas
dataframes, it's a common situation to create a column B
with the information in column A
.
Background
In some cases, it's possible to do this in one go (df['B'] = df['A'] + 4
), but in others, the operation is more complex and a separate function is written. In that case, this function can be applied in one of two ways (that I know of):
def calc_b(a):
return a + 4
df = pd.DataFrame({'A': np.random.randint(0, 50, 5)})
df['B1'] = df['A'].apply(lambda x: calc_b(x))
df['B2'] = np.vectorize(calc_b)(df['A'])
The resulting dataframe:
A B1 B2
0 17 21 21
1 25 29 29
2 6 10 10
3 21 25 25
4 14 18 18
Perfect - both ways have the correct result. In my code, I've been using the np.vectorize
way, as .apply
is slow and considered bad practise.
Now comes my problem
This method seems to be breaking down when working with datetimes / timestamps. A minimal working example is this:
def is_past_midmonth(dt):
return (dt.day > 15)
df = pd.DataFrame({'date':pd.date_range('2020-01-01', freq='6D', periods=7)})
df['past_midmonth1'] = df['date'].apply(lambda x: is_past_midmonth(x))
df['past_midmonth2'] = np.vectorize(is_past_midmonth)(df['date'])
The .apply
way works; the resulting dataframe is
date past_midmonth1
0 2020-01-01 False
1 2020-01-07 False
2 2020-01-13 False
3 2020-01-19 True
4 2020-01-25 True
5 2020-01-31 True
6 2020-02-06 False
But the np.vectorize
way fails with an AttributeError: 'numpy.datetime64' object has no attribute 'day'
.
Digging a bit with type()
, the elements of df['date']
are of the <class 'pandas._libs.tslibs.timestamps.Timestamp'>
, which is also how the function receives them. In the vectorized function, however, they are received as instances of <class 'numpy.datetime64'>
, which then causes the error.
I have two questions:
- Is there a way to 'fix' this behaviour of
np.vectorize
? How? - How can I avoid these kinds of incompatibilities in general?
Of course I can make a mental note to not use np.vectorize
functions that take datetime arguments, but that is cumbersome. I'd like a solution that always works so I don't have to think about it whenever I encounter this situation.
As stated, this is a minimal working example that demonstrates the problem. I know I could use easier, all-column-at-once operations in this case, exactly as I could in the first example with the int
column. But that's beside the point here; I'm interested in the general case of vectorizing any function that takes timestamp arguments. For those asking about a more concrete/complicated example, I've created one here.
Edit: I was wondering if using type hinting would make a difference - if numpy
would actually take this information into account - but I doubt it, as using this signature def is_past_midmonth(dt: float) -> bool:
, where float
is obviously wrong, gives the same error. I'm pretty new to type hinting though, and I don't have an IDE that supports it, so it's a bit hard for me to debug.
Many thanks!
答案1
得分: 3
你考虑过将日期表示为int
而不是datetime64[ns]
吗?
import pandas as pd
import numpy as np
# 我会避免使用dt,因为它被用作datetime的别名
def is_past_midmonth1(d):
return (d.day > 15)
def is_past_midmonth2(day):
return (day > 15)
N = int(1e4)
df = pd.DataFrame({'date':pd.date_range('2020-01-01', freq='6D',
periods=N)})
应用(使用datetime)
%%time
df['past_midmonth1'] = df['date'].apply(lambda x: is_past_midmonth1(x))
CPU时间:用户 55.4 毫秒,系统:0 毫秒,总共:55.4 毫秒
墙时间:53.8 毫秒
应用(使用int)
%%time
df['past_midmonth2'] = (df['date'].dt.day).apply(lambda x: is_past_midmonth2(x))
CPU时间:用户 4.71 毫秒,系统:0 毫秒,总共:4.71 毫秒
墙时间:4.16 毫秒
np.vectorize
%%time
df['past_midmonth2_vec'] = np.vectorize(is_past_midmonth2)(df['date'].dt.day)
CPU时间:用户 4.2 毫秒,系统:75 微秒,总共:4.27 毫秒
墙时间:3.49 毫秒
向量化您的代码
%%time
df['past_midmonth3'] = df["date"].dt.day > 15
CPU时间:用户 3.1 毫秒,系统:11 微秒,总共:3.11 毫秒
墙时间:2.41 毫秒
计时
英文:
Have you consider passing the day as int
instead of the datetime64[ns]
?
import pandas as pd
import numpy as np
# I'd avoid use dt as it's used as alias for datetime
def is_past_midmonth1(d):
return (d.day > 15)
def is_past_midmonth2(day):
return (day > 15)
N = int(1e4)
df = pd.DataFrame({'date':pd.date_range('2020-01-01', freq='6D',
periods=N)})
Apply (using datetime)
%%time
df['past_midmonth1'] = df['date'].apply(lambda x: is_past_midmonth1(x))
CPU times: user 55.4 ms, sys: 0 ns, total: 55.4 ms
Wall time: 53.8 ms
Apply (using int)
%%time
df['past_midmonth2'] = (df['date'].dt.day).apply(lambda x: is_past_midmonth2(x))
CPU times: user 4.71 ms, sys: 0 ns, total: 4.71 ms
Wall time: 4.16 ms
np.vectorize
%%time
df['past_midmonth2_vec'] = np.vectorize(is_past_midmonth2)(df['date'].dt.day)
CPU times: user 4.2 ms, sys: 75 µs, total: 4.27 ms
Wall time: 3.49 ms
Vectorizing your code
%%time
df['past_midmonth3'] = df["date"].dt.day>15
CPU times: user 3.1 ms, sys: 11 µs, total: 3.11 ms
Wall time: 2.41 ms
Timing
答案2
得分: 0
通过将传入的 dt
参数强制转换为 pandas
的日期时间对象,使用 dt = pd.to_datetime(dt)
,它就能正常工作了。
def is_past_midmonth(dt):
dt = pd.to_datetime(dt) # 唯一的额外操作
return (dt.day > 15)
df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='6D', periods=7)})
df['past_midmonth1'] = df['date'].apply(lambda x: is_past_midmonth(x))
df['past_midmonth2'] = np.vectorize(is_past_midmonth)(df['date']) # 这个现在能正常工作
In[45]: df
Out[45]:
date past_midmonth1 past_midmonth2
0 2020-01-01 False False
1 2020-01-07 False False
2 2020-01-13 False False
3 2020-01-19 True True
4 2020-01-25 True True
5 2020-01-31 True True
6 2020-02-06 False False
对于感兴趣的人 - 执行时间减少了大约一半(对于更长的数据帧)。
英文:
I'll write this as an Answer, though I feel it's barely a Workaround; so please add your answer if you have one that's better.
By forcing the incoming dt
argument into a pandas
datetime object with dt = pd.to_datetime(dt)
, it works.
def is_past_midmonth(dt):
dt = pd.to_datetime(dt) #the only addition
return (dt.day > 15)
df = pd.DataFrame({'date':pd.date_range('2020-01-01', freq='6D', periods=7)})
df['past_midmonth1'] = df['date'].apply(lambda x: is_past_midmonth(x))
df['past_midmonth2'] = np.vectorize(is_past_midmonth)(df['date']) #this now works
In[45]: df
Out[45]:
date past_midmonth1 past_midmonth2
0 2020-01-01 False False
1 2020-01-07 False False
2 2020-01-13 False False
3 2020-01-19 True True
4 2020-01-25 True True
5 2020-01-31 True True
6 2020-02-06 False False
For those interested - execution time is about halved (for a longer dataframe).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论