Numpy向量化操作会导致数据类型混乱。

huangapple go评论94阅读模式
英文:

Numpy vectorization messes up data type

问题

使用 pandas 数据框时,常见的情况是创建一个名为 B 的列,其中包含列 A 中的信息。

背景

在某些情况下,可以一次完成这个操作(df['B'] = df['A'] + 4),但在其他情况下,操作会更复杂,需要编写一个单独的函数。在这种情况下,我知道有两种方法可以应用这个函数:

def calc_b(a): 
    return a + 4

df = pd.DataFrame({'A': np.random.randint(0, 50, 5)})
df['B1'] = df['A'].apply(lambda x: calc_b(x))
df['B2'] = np.vectorize(calc_b)(df['A'])

结果数据框如下:

    A  B1  B2
0  17  21  21
1  25  29  29
2   6  10  10
3  21  25  25
4  14  18  18

很好 - 两种方法都得到了正确的结果。在我的代码中,我一直使用 np.vectorize 的方法,因为.apply 很慢,而且被认为是不良实践。

现在出现了我的问题

在处理日期/时间戳时,这种方法似乎出现了问题。一个最小的工作示例如下:

def is_past_midmonth(dt):
    return (dt.day > 15)

df = pd.DataFrame({'date':pd.date_range('2020-01-01', freq='6D', periods=7)})
df['past_midmonth1'] = df['date'].apply(lambda x: is_past_midmonth(x))
df['past_midmonth2'] = np.vectorize(is_past_midmonth)(df['date'])

.apply 方法有效;结果数据框如下:

        date  past_midmonth1
0 2020-01-01           False
1 2020-01-07           False
2 2020-01-13           False
3 2020-01-19            True
4 2020-01-25            True
5 2020-01-31            True
6 2020-02-06           False

np.vectorize 方法失败,报错 AttributeError: 'numpy.datetime64' object has no attribute 'day'

通过使用 type() 进行一些调查,df['date'] 的元素是 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 类的实例,这也是函数接收到它们的方式。然而,在矢量化函数中,它们被接收为 <class 'numpy.datetime64'> 的实例,这就导致了错误。

我有两个问题:

  • 是否有办法“修复”np.vectorize 的这种行为?如何修复?
  • 如何一般性地避免这种类型的不兼容性?

当然,我可以心里记住不使用接受日期时间参数的np.vectorize 函数,但那很麻烦。我想要一个能始终工作的解决方案,这样我在遇到这种情况时就不必考虑它。

如上所述,这是一个演示问题的最小工作示例。我知道在这种情况下可以使用更简单的一次性操作,就像在第一个示例中使用整数列一样。但这不是重点;我对一般情况下对任何接受时间戳参数的函数进行矢量化的方法感兴趣。对于那些要求更具体/复杂示例的人,我创建了一个在这里

编辑:我在想是否使用类型提示会有所不同 - 如果 numpy 实际上会考虑这些信息 - 但我怀疑不会,因为使用这个签名 def is_past_midmonth(dt: float) -> bool:,其中 float 显然是错误的,会导致相同的错误。我对类型提示还相当陌生,而且我没有支持它的IDE,所以很难进行调试。

非常感谢!

英文:

When using pandas dataframes, it's a common situation to create a column B with the information in column A.

Background

In some cases, it's possible to do this in one go (df[&#39;B&#39;] = df[&#39;A&#39;] + 4), but in others, the operation is more complex and a separate function is written. In that case, this function can be applied in one of two ways (that I know of):

def calc_b(a): 
    return a + 4

df = pd.DataFrame({&#39;A&#39;: np.random.randint(0, 50, 5)})
df[&#39;B1&#39;] = df[&#39;A&#39;].apply(lambda x: calc_b(x))
df[&#39;B2&#39;] = np.vectorize(calc_b)(df[&#39;A&#39;])

The resulting dataframe:

    A  B1  B2
0  17  21  21
1  25  29  29
2   6  10  10
3  21  25  25
4  14  18  18

Perfect - both ways have the correct result. In my code, I've been using the np.vectorize way, as .apply is slow and considered bad practise.

Now comes my problem

This method seems to be breaking down when working with datetimes / timestamps. A minimal working example is this:

def is_past_midmonth(dt):
    return (dt.day &gt; 15)

df = pd.DataFrame({&#39;date&#39;:pd.date_range(&#39;2020-01-01&#39;, freq=&#39;6D&#39;, periods=7)})
df[&#39;past_midmonth1&#39;] = df[&#39;date&#39;].apply(lambda x: is_past_midmonth(x))
df[&#39;past_midmonth2&#39;] = np.vectorize(is_past_midmonth)(df[&#39;date&#39;])

The .apply way works; the resulting dataframe is

        date  past_midmonth1
0 2020-01-01           False
1 2020-01-07           False
2 2020-01-13           False
3 2020-01-19            True
4 2020-01-25            True
5 2020-01-31            True
6 2020-02-06           False

But the np.vectorize way fails with an AttributeError: &#39;numpy.datetime64&#39; object has no attribute &#39;day&#39;.

Digging a bit with type(), the elements of df[&#39;date&#39;] are of the &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;, which is also how the function receives them. In the vectorized function, however, they are received as instances of &lt;class &#39;numpy.datetime64&#39;&gt;, which then causes the error.

I have two questions:

  • Is there a way to 'fix' this behaviour of np.vectorize? How?
  • How can I avoid these kinds of incompatibilities in general?

Of course I can make a mental note to not use np.vectorize functions that take datetime arguments, but that is cumbersome. I'd like a solution that always works so I don't have to think about it whenever I encounter this situation.

As stated, this is a minimal working example that demonstrates the problem. I know I could use easier, all-column-at-once operations in this case, exactly as I could in the first example with the int column. But that's beside the point here; I'm interested in the general case of vectorizing any function that takes timestamp arguments. For those asking about a more concrete/complicated example, I've created one here.

Edit: I was wondering if using type hinting would make a difference - if numpy would actually take this information into account - but I doubt it, as using this signature def is_past_midmonth(dt: float) -&gt; bool:, where float is obviously wrong, gives the same error. I'm pretty new to type hinting though, and I don't have an IDE that supports it, so it's a bit hard for me to debug.

Many thanks!

答案1

得分: 3

你考虑过将日期表示为int而不是datetime64[ns]吗?

import pandas as pd
import numpy as np

# 我会避免使用dt,因为它被用作datetime的别名
def is_past_midmonth1(d): 
    return (d.day > 15)

def is_past_midmonth2(day):
    return (day > 15)

N = int(1e4)
df = pd.DataFrame({'date':pd.date_range('2020-01-01', freq='6D',
                                        periods=N)})

应用(使用datetime)

%%time
df['past_midmonth1'] = df['date'].apply(lambda x: is_past_midmonth1(x))

CPU时间用户 55.4 毫秒系统0 毫秒总共55.4 毫秒
墙时间53.8 毫秒

应用(使用int)

%%time
df['past_midmonth2'] = (df['date'].dt.day).apply(lambda x: is_past_midmonth2(x))

CPU时间用户 4.71 毫秒系统0 毫秒总共4.71 毫秒
墙时间4.16 毫秒

np.vectorize

%%time
df['past_midmonth2_vec'] = np.vectorize(is_past_midmonth2)(df['date'].dt.day)

CPU时间用户 4.2 毫秒系统75 微秒总共4.27 毫秒
墙时间3.49 毫秒

向量化您的代码

%%time
df['past_midmonth3'] = df["date"].dt.day > 15

CPU时间用户 3.1 毫秒系统11 微秒总共3.11 毫秒
墙时间2.41 毫秒

计时

Numpy向量化操作会导致数据类型混乱。

英文:

Have you consider passing the day as int instead of the datetime64[ns]?

import pandas as pd
import numpy as np

# I&#39;d avoid use dt as it&#39;s used as alias for datetime
def is_past_midmonth1(d): 
    return (d.day &gt; 15)

def is_past_midmonth2(day):
    return (day &gt; 15)

N = int(1e4)
df = pd.DataFrame({&#39;date&#39;:pd.date_range(&#39;2020-01-01&#39;, freq=&#39;6D&#39;,
                                        periods=N)})

Apply (using datetime)

%%time
df[&#39;past_midmonth1&#39;] = df[&#39;date&#39;].apply(lambda x: is_past_midmonth1(x))

CPU times: user 55.4 ms, sys: 0 ns, total: 55.4 ms
Wall time: 53.8 ms

Apply (using int)

%%time
df[&#39;past_midmonth2&#39;] = (df[&#39;date&#39;].dt.day).apply(lambda x: is_past_midmonth2(x))

CPU times: user 4.71 ms, sys: 0 ns, total: 4.71 ms
Wall time: 4.16 ms

np.vectorize

%%time
df[&#39;past_midmonth2_vec&#39;] = np.vectorize(is_past_midmonth2)(df[&#39;date&#39;].dt.day)

CPU times: user 4.2 ms, sys: 75 &#181;s, total: 4.27 ms
Wall time: 3.49 ms

Vectorizing your code

%%time
df[&#39;past_midmonth3&#39;] = df[&quot;date&quot;].dt.day&gt;15

CPU times: user 3.1 ms, sys: 11 &#181;s, total: 3.11 ms
Wall time: 2.41 ms

Timing

Numpy向量化操作会导致数据类型混乱。

答案2

得分: 0

通过将传入的 dt 参数强制转换为 pandas 的日期时间对象,使用 dt = pd.to_datetime(dt),它就能正常工作了。

def is_past_midmonth(dt):
    dt = pd.to_datetime(dt) # 唯一的额外操作
    return (dt.day > 15)

df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='6D', periods=7)})
df['past_midmonth1'] = df['date'].apply(lambda x: is_past_midmonth(x))
df['past_midmonth2'] = np.vectorize(is_past_midmonth)(df['date']) # 这个现在能正常工作
In[45]: df
Out[45]: 
        date  past_midmonth1  past_midmonth2
0 2020-01-01           False           False
1 2020-01-07           False           False
2 2020-01-13           False           False
3 2020-01-19            True            True
4 2020-01-25            True            True
5 2020-01-31            True            True
6 2020-02-06           False           False

对于感兴趣的人 - 执行时间减少了大约一半(对于更长的数据帧)。

英文:

I'll write this as an Answer, though I feel it's barely a Workaround; so please add your answer if you have one that's better. Numpy向量化操作会导致数据类型混乱。

By forcing the incoming dt argument into a pandas datetime object with dt = pd.to_datetime(dt), it works.

def is_past_midmonth(dt):
    dt = pd.to_datetime(dt) #the only addition
    return (dt.day &gt; 15)

df = pd.DataFrame({&#39;date&#39;:pd.date_range(&#39;2020-01-01&#39;, freq=&#39;6D&#39;, periods=7)})
df[&#39;past_midmonth1&#39;] = df[&#39;date&#39;].apply(lambda x: is_past_midmonth(x))
df[&#39;past_midmonth2&#39;] = np.vectorize(is_past_midmonth)(df[&#39;date&#39;]) #this now works
In[45]: df
Out[45]: 
        date  past_midmonth1  past_midmonth2
0 2020-01-01           False           False
1 2020-01-07           False           False
2 2020-01-13           False           False
3 2020-01-19            True            True
4 2020-01-25            True            True
5 2020-01-31            True            True
6 2020-02-06           False           False

For those interested - execution time is about halved (for a longer dataframe).

huangapple
  • 本文由 发表于 2020年1月3日 19:03:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/59577442.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定