Numpy向量化破坏了数据类型 (2)

huangapple go评论76阅读模式
英文:

Numpy vectorization messes up data type (2)

问题

我正在使用 np.vectorize 时出现不希望的行为,它改变了传入原始函数的参数的数据类型。我的原始问题涉及一般情况,而我将使用这个新问题来提出一个更具体的情况。

(为什么要提出第二个问题?我创建了这个关于更具体情况的问题,以便说明问题 - 从具体到一般的思考通常更容易。我创建了这个问题是因为我认为保持一般情况以及一般答案(如果找到的话)独立开来是有用的,不受解决特定问题的思考的“污染”影响。)

所以,一个具体的例子。在我居住的地方,星期三是彩票日。因此,让我们从一个包含今年所有星期三的 pandas 数据框开始:

df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', periods=53)})

我想知道我实际上会在这些可能的日期中玩哪些。我一开始不觉得自己在每个月初和月末特别幸运,还有一些月份我觉得特别不幸。因此,我使用这个函数来查看日期是否符合条件:

def qualifies(dt, excluded_months=[]):
    # 日期符合条件,如果...
    # - 它在月份的5号或之后;以及
    # - 直到月底至少还有5天(包括日期本身);以及
    # - 它不在 excluded_months 中的某个月份。
    if dt.day < 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
        return False
    if dt.month in excluded_months:
        return False
    return True

我希望你明白这个例子仍然有点牵强 Numpy向量化破坏了数据类型 (2) 但它更接近我试图做的事情。我尝试以两种方式应用这个函数:

df['qualifies1'] = df['date'].apply(lambda x: qualifies(x, [3, 8]))
df['qualifies2'] = np.vectorize(qualifies, excluded=[1])(df['date'], [3, 8])

据我所知,这两种方法都应该起作用,我更喜欢后者,因为前者速度较慢,并且被不鼓励使用。**编辑:**我了解到第一种方法也被不鼓励使用lol。

然而,只有第一种方法成功,第二种方法失败,显示AttributeError: 'numpy.datetime64' 对象没有 'day' 属性。因此,我的问题是,是否有一种方法可以在这个函数 qualifies 上使用 np.vectorize,该函数以日期/时间戳作为参数。

非常感谢!

PS:对于感兴趣的人,这是 df

In [15]: df
Out[15]: 
         date  qualifies1
0  2020-01-01       False
1  2020-01-08        True
2  2020-01-15        True
3  2020-01-22        True
4  2020-01-29       False
...省略部分内容...
英文:

I'm having unwanted behaviour come out of np.vectorize, namely, it changes the datatype of the argument going into the original function. My original question is about the general case, and I'll use this new question to ask a more specific case.

(Why this second question? I've created this question about a more specific case in order to illustrate the problem - it's always easier to go from the specific to the more general. And I've created this question seperately, because I think it's useful to keep the general case, as well as a general answer to it (should one be found), by themselves and not 'contaminated' with thinking about solving any particular problem.)

So, a concrete example. Where I live, Wednesday is Lottery Day. So, let's start with a pandas dataframe with a date column with all Wednesdays this year:

df = pd.DataFrame({&#39;date&#39;: pd.date_range(&#39;2020-01-01&#39;, freq=&#39;7D&#39;, periods=53)})

I want to see which of these possible days I'll actually play on. I don't feel particularly lucky at the beginning and end of each month, and there are some months I feel especially unlucky about. Therefore I use this function to see if a date qualifies:

def qualifies(dt, excluded_months = []):
    #Date qualifies, if...
    #. it&#39;s on or after the 5th of the month; and
    #. at least 5 days remain till the end of the month (incl. date itself); and
    #. it&#39;s not in one of the months in excluded_months.
    if dt.day &lt; 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days &lt; 5:
        return False
    if dt.month in excluded_months:
        return False
    return True

I hope you realise that this example is still somewhat contrived Numpy向量化破坏了数据类型 (2) But it's closer to what I'm trying to do. I try to apply this function in two ways:

df[&#39;qualifies1&#39;] = df[&#39;date&#39;].apply(lambda x: qualifies(x, [3, 8]))
df[&#39;qualifies2&#39;] = np.vectorize(qualifies, excluded=[1])(df[&#39;date&#39;], [3, 8])

As far as I know, both should work, and I'd prefer the latter, as the former is slow and frowned upon. Edit: I've learned that also the first is frowned upon lol.

However, only the first one succeeds, the second one fails with an AttributeError: &#39;numpy.datetime64&#39; object has no attribute &#39;day&#39;. And so my question is, if there is a way to use np.vectorize on this function qualifies, which takes a datetime/timestamp as an argument.

Many thanks!

PS: for the interested, this is df:

In [15]: df
Out[15]: 
         date  qualifies1
0  2020-01-01       False
1  2020-01-08        True
2  2020-01-15        True
3  2020-01-22        True
4  2020-01-29       False
5  2020-02-05        True
6  2020-02-12        True
7  2020-02-19        True
8  2020-02-26       False
9  2020-03-04       False
10 2020-03-11       False
11 2020-03-18       False
12 2020-03-25       False
13 2020-04-01       False
14 2020-04-08        True
15 2020-04-15        True
16 2020-04-22        True
17 2020-04-29       False
18 2020-05-06        True
19 2020-05-13        True
20 2020-05-20        True
21 2020-05-27        True
22 2020-06-03       False
23 2020-06-10        True
24 2020-06-17        True
25 2020-06-24        True
26 2020-07-01       False
27 2020-07-08        True
28 2020-07-15        True
29 2020-07-22        True
30 2020-07-29       False
31 2020-08-05       False
32 2020-08-12       False
33 2020-08-19       False
34 2020-08-26       False
35 2020-09-02       False
36 2020-09-09        True
37 2020-09-16        True
38 2020-09-23        True
39 2020-09-30       False
40 2020-10-07        True
41 2020-10-14        True
42 2020-10-21        True
43 2020-10-28       False
44 2020-11-04       False
45 2020-11-11        True
46 2020-11-18        True
47 2020-11-25        True
48 2020-12-02       False
49 2020-12-09        True
50 2020-12-16        True
51 2020-12-23        True
52 2020-12-30       False

答案1

得分: 2

我认为 @rpanai 在原帖上的回答仍然是最好的。在这里,我分享我的测试:

def qualifies(dt, excluded_months=[]):
    if dt.day < 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
        return False
    if dt.month in excluded_months:
        return False
    return True

def new_qualifies(dt, excluded_months=[]):
    dt = pd.Timestamp(dt)
    if dt.day < 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
        return False
    if dt.month in excluded_months:
        return False
    return True

df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', periods=12000)})

应用方法:

%%timeit
df['qualifies1'] = df['date'].apply(lambda x: qualifies(x, [3, 8]))

385毫秒 ± 21.6毫秒每次循环(均值 ± 7次运行的标准差,1次循环每次运行)


转换方法:

%%timeit
df['qualifies1'] = df['date'].apply(lambda x: new_qualifies(x, [3, 8]))

389毫秒 ± 12.6毫秒每次循环(均值 ± 7次运行的标准差,1次循环每次运行)


矢量化代码:

%%timeit
df['qualifies2'] =  np.logical_not((df['date'].dt.day < 5).values | \
    ((df['date']+pd.tseries.offsets.MonthBegin(1)-df['date']).dt.days < 5).values |\
    (df['date'].dt.month.isin([3, 8])).values)

4.83毫秒 ± 117微秒每次循环(均值 ± 7次运行的标准差,100次循环每次运行)

英文:

I think @rpanai answer on the original post is still the best. Here I share my tests:

def qualifies(dt, excluded_months = []):
    if dt.day &lt; 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days &lt; 5:
        return False
    if dt.month in excluded_months:
        return False
    return True

def new_qualifies(dt, excluded_months = []):
    dt = pd.Timestamp(dt)
    if dt.day &lt; 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days &lt; 5:
        return False
    if dt.month in excluded_months:
        return False
    return True

df = pd.DataFrame({&#39;date&#39;: pd.date_range(&#39;2020-01-01&#39;, freq=&#39;7D&#39;, periods=12000)})

apply method:

%%timeit
df[&#39;qualifies1&#39;] = df[&#39;date&#39;].apply(lambda x: qualifies(x, [3, 8]))

385 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


conversion method:

%%timeit
df[&#39;qualifies1&#39;] = df[&#39;date&#39;].apply(lambda x: new_qualifies(x, [3, 8]))

389 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


vectorized code:

%%timeit
df[&#39;qualifies2&#39;] =  np.logical_not((df[&#39;date&#39;].dt.day&lt;5).values | \
    ((df[&#39;date&#39;]+pd.tseries.offsets.MonthBegin(1)-df[&#39;date&#39;]).dt.days &lt; 5).values |\
    (df[&#39;date&#39;].dt.month.isin([3, 8])).values)

4.83 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

答案2

得分: 1

以下是翻译好的部分:

摘要

如果使用 np.vectorize,最好指定 otypes。在这种情况下,错误是由于未指定 otypesvectorize 使用的试算引起的。另一种方法是将Series作为对象类型数组传递。

np.vectorize 有性能免责声明。np.frompyfunc 可能更快,甚至可以使用列表推导。

测试 vectorize

让我们定义一个更简单的函数 - 一个显示参数类型的函数:

def foo(dt, excluded_months=[]):
    print(dt, type(dt))
    return True

和一个较小的数据框:

df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', periods=5)})

测试 vectorize。 (vectorize 文档表示使用 excluded 参数会降低性能,因此我使用了 lambda,就像在 apply 中使用的一样:

np.vectorize(lambda x: foo(x, [3, 8]))(df['date'])

那第一行是会引发问题的 datetime64。其他行是原始的 pandas 对象。如果我指定 otypes,那个问题就消失了:

np.vectorize(lambda x: foo(x, [3, 8]), otypes=['bool'])(df['date'])

然后是 apply

df['date'].apply(lambda x: foo(x, [3, 8]))

通过将 Series 包装在 np.array 中,会生成 datetime64 数据类型:

np.array(df['date'])

显然,np.vectorize 在执行初始试算计算时会执行此类包装,但在执行主要迭代时不会执行。指定 otypes 跳过了试算计算。在其他情况下,试算计算已经引发了问题,尽管这是一个更不常见的情况。

过去,当我测试过 np.vectorize 时,它比更明确的迭代要慢。它有明确的性能免责声明。当函数接受多个输入并需要广播的好处时,它最有价值。如果只使用一个参数,很难证明它的用途。

np.frompyfuncvectorize 的基础,但返回一个对象数据类型。通常它比数组上的显式迭代快2倍,但与列表上的迭代速度相似。它似乎在创建和使用对象的 numpy 数组时最有用。在这种情况下,我还没有让它起作用。

vectorize 代码

np.vectorize 代码位于 np.lib.function_base.py

如果未指定 otypes,代码执行以下操作:

args = [asarray(arg) for arg in args]
inputs = [arg.flat[0] for arg in args]
outputs = func(*inputs)

它将每个参数(在这里只有一个)转换为数组,并获取第一个元素。然后将其传递给 func。正如 Out[37] 所示,这将是一个 datetime64 对象。

frompyfunc

要使用 frompyfunc,我需要转换 df['date'] 的数据类型:

np.frompyfunc(lambda x: foo(x, [3, 8]), 1, 1)(df['date'])

没有它,它将 int 传递给函数,有它,它将 pandas 时间对象传递给函数:

np.frompyfunc(lambda x: foo(x, [3, 8]), 1, 1)(df['date'].astype(object))

因此,这种使用 qualifies 的方法有效:

np.frompyfunc(lambda x: qualifies(x, [3, 8]), 1, 1)(df['date'].astype(object))

或者更好的是,普通的 Python 迭代:

[qualifies(x, [3, 8]) for x in df['date']]
英文:

Summary

If using np.vectorize it's best to specify otypes. In this case, the error is caused by the trial calculation the vectorize uses when otypes is not specified. An alternative is to pass the Series as an object type array.

np.vectorize has a performance disclaimer. np.frompyfunc may be faster, or even a list comprehension.

testing vectorize

Let's define a simpler function - one that displays the type of the argument:

In [31]: def foo(dt, excluded_months=[]): 
    ...:     print(dt,type(dt)) 
    ...:     return True 

And a smaller dataframe:

In [32]: df = pd.DataFrame({&#39;date&#39;: pd.date_range(&#39;2020-01-01&#39;, freq=&#39;7D&#39;, perio
    ...: ds=5)})                                                                
In [33]: df                                                                     
Out[33]: 
        date
0 2020-01-01
1 2020-01-08
2 2020-01-15
3 2020-01-22
4 2020-01-29

Testing vectorize. (vectorize docs says using the excluded parameter degrades performance, so I'm using lambda as used by with apply):

In [34]: np.vectorize(lambda x:foo(x,[3,8]))(df[&#39;date&#39;])                        
2020-01-01T00:00:00.000000000 &lt;class &#39;numpy.datetime64&#39;&gt;
2020-01-01 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-08 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-15 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-22 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-29 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
Out[34]: array([ True,  True,  True,  True,  True])

That first line is the datetime64 that gives problems. The other lines are the orginal pandas objects. If I specify the otypes, that problem goes away:

In [35]: np.vectorize(lambda x:foo(x,[3,8]), otypes=[&#39;bool&#39;])(df[&#39;date&#39;])       
2020-01-01 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-08 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-15 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-22 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-29 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
Out[35]: array([ True,  True,  True,  True,  True])

the apply:

In [36]: df[&#39;date&#39;].apply(lambda x: foo(x, [3, 8]))                             
2020-01-01 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-08 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-15 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-22 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-29 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
Out[36]: 
0    True
1    True
2    True
3    True
4    True
Name: date, dtype: bool

A datetime64 dtype is produced by wrapping the the Series in np.array.

In [37]: np.array(df[&#39;date&#39;])                                                   
Out[37]: 
array([&#39;2020-01-01T00:00:00.000000000&#39;, &#39;2020-01-08T00:00:00.000000000&#39;,
       &#39;2020-01-15T00:00:00.000000000&#39;, &#39;2020-01-22T00:00:00.000000000&#39;,
       &#39;2020-01-29T00:00:00.000000000&#39;], dtype=&#39;datetime64[ns]&#39;)

Apparently np.vectorize is doing this sort of wrapping when performing the initial trial calculation, but not when doing the main iterations. Specifying the otypes skips that trial calculation. That trial calculation has caused problems in other SO, though this is a more obscure case.

In that past when I've tested np.vectorize it is slower than a more explicit iteration. It does have a clear performance disclaimer. It's most valuable when the function takes several inputs, and needs the benefit of broadcasting. It's hard to justify when using only one argument.

np.frompyfunc underlies vectorize, but returns an object dtype. Often it is 2x faster than explicit iteration on an array, though similar in speed to iteration on a list. It seems to be most useful when creating and working with a numpy array of objects. I haven't gotten it working in this case.

vectorize code

The np.vectorize code is in np.lib.function_base.py.

If otypes is not specified, the code does:

        args = [asarray(arg) for arg in args]
        inputs = [arg.flat[0] for arg in args]
        outputs = func(*inputs)

It makes each argument (here only one) into an array, and takes the first element. And then passes that to the func. As Out[37] shows, that will be a datetime64 object.

frompyfunc

To use frompyfunc, I need to convert the dtype of df[&#39;date&#39;]:

In [68]: np.frompyfunc(lambda x:foo(x,[3,8]), 1,1)(df[&#39;date&#39;])                  
1577836800000000000 &lt;class &#39;int&#39;&gt;
1578441600000000000 &lt;class &#39;int&#39;&gt;
...

without it, it passes int to the function, with it, it passes the pandas time objects:

In [69]: np.frompyfunc(lambda x:foo(x,[3,8]), 1,1)(df[&#39;date&#39;].astype(object))   
2020-01-01 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-08 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
...

So this use of qualifies works:

In [71]: np.frompyfunc(lambda x:qualifies(x,[3,8]),1,1)(df[&#39;date&#39;].astype(object))                                                                     
Out[71]: 
0    False
1     True
2     True
3     True
4    False
Name: date, dtype: object

object dtype

For the main iteration, np.vectorize does

      ufunc = frompyfunc(_func, len(args), nout)
      # Convert args to object arrays first
        inputs = [array(a, copy=False, subok=True, dtype=object)
                  for a in args]
        outputs = ufunc(*inputs)

That explains why vectorize with otypes works - it is using frompyfunc with an object dtype input. Contrast this with Out[37]:

In [74]: np.array(df[&#39;date&#39;], dtype=object)                                     
Out[74]: 
array([Timestamp(&#39;2020-01-01 00:00:00&#39;), Timestamp(&#39;2020-01-08 00:00:00&#39;),
       Timestamp(&#39;2020-01-15 00:00:00&#39;), Timestamp(&#39;2020-01-22 00:00:00&#39;),
       Timestamp(&#39;2020-01-29 00:00:00&#39;)], dtype=object)

And an alternative to specifying otypes is to make sure you are passing object dtype to vectorize:

In [75]: np.vectorize(qualifies, excluded=[1])(df[&#39;date&#39;].astype(object), [3, 8])                                                                      
Out[75]: array([False,  True,  True,  True, False])

This appears to be the fastest version:

np.frompyfunc(lambda x: qualifies(x,[3,8]),1,1)(np.array(df[&#39;date&#39;],object))    

or better yet, a plain Python iteration:

[qualifies(x,[3,8]) for x in df[&#39;date&#39;]] 

答案3

得分: 0

Just as in the original question, I can "solve" the problem by forcing the incoming argument to be a pandas datetime object, by adding dt = pd.to_datetime(dt) before the first if-statement of the function.

To be honest, this feels like patching-up something that's broken and should not be used. I'll just use .apply instead and take the performance hit. Anyone that feels there's a better solution is very much invited to share Numpy向量化破坏了数据类型 (2)

英文:

Just as in the original question, I can "solve" the problem by forcing the incoming argument to be a pandas datetime object, by adding dt = pd.to_datetime(dt) before the first if-statement of the function.

To be honest, this feels like patching-up something that's broken and should not be used. I'll just use .apply instead and take the performance hit. Anyone that feels there's a better solution is very much invited to share Numpy向量化破坏了数据类型 (2)

huangapple
  • 本文由 发表于 2020年1月3日 22:42:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/59580504.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定