2020年1月3日 22:42:33go评论122阅读模式

英文:

Numpy vectorization messes up data type (2)

问题

我正在使用 np.vectorize 时出现不希望的行为，它改变了传入原始函数的参数的数据类型。我的原始问题涉及一般情况，而我将使用这个新问题来提出一个更具体的情况。

（为什么要提出第二个问题？我创建了这个关于更具体情况的问题，以便说明问题 - 从具体到一般的思考通常更容易。我创建了这个问题是因为我认为保持一般情况以及一般答案（如果找到的话）独立开来是有用的，不受解决特定问题的思考的“污染”影响。）

所以，一个具体的例子。在我居住的地方，星期三是彩票日。因此，让我们从一个包含今年所有星期三的 pandas 数据框开始：

df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', periods=53)})

我想知道我实际上会在这些可能的日期中玩哪些。我一开始不觉得自己在每个月初和月末特别幸运，还有一些月份我觉得特别不幸。因此，我使用这个函数来查看日期是否符合条件：

def qualifies(dt, excluded_months=[]):
    # 日期符合条件，如果...
    # - 它在月份的5号或之后；以及
    # - 直到月底至少还有5天（包括日期本身）；以及
    # - 它不在 excluded_months 中的某个月份。
    if dt.day < 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
        return False
    if dt.month in excluded_months:
        return False
    return True

我希望你明白这个例子仍然有点牵强但它更接近我试图做的事情。我尝试以两种方式应用这个函数：

df['qualifies1'] = df['date'].apply(lambda x: qualifies(x, [3, 8]))
df['qualifies2'] = np.vectorize(qualifies, excluded=[1])(df['date'], [3, 8])

据我所知，这两种方法都应该起作用，我更喜欢后者，因为前者速度较慢，并且被不鼓励使用。**编辑：**我了解到第一种方法也被不鼓励使用lol。

然而，只有第一种方法成功，第二种方法失败，显示AttributeError: 'numpy.datetime64' 对象没有 'day' 属性。因此，我的问题是，是否有一种方法可以在这个函数 qualifies 上使用 np.vectorize，该函数以日期/时间戳作为参数。

非常感谢！

PS：对于感兴趣的人，这是 df：

In [15]: df
Out[15]: 
         date  qualifies1
0  2020-01-01       False
1  2020-01-08        True
2  2020-01-15        True
3  2020-01-22        True
4  2020-01-29       False
...（省略部分内容）...

英文:

I'm having unwanted behaviour come out of np.vectorize, namely, it changes the datatype of the argument going into the original function. My original question is about the general case, and I'll use this new question to ask a more specific case.

(Why this second question? I've created this question about a more specific case in order to illustrate the problem - it's always easier to go from the specific to the more general. And I've created this question seperately, because I think it's useful to keep the general case, as well as a general answer to it (should one be found), by themselves and not 'contaminated' with thinking about solving any particular problem.)

So, a concrete example. Where I live, Wednesday is Lottery Day. So, let's start with a pandas dataframe with a date column with all Wednesdays this year:

df = pd.DataFrame({&#39;date&#39;: pd.date_range(&#39;2020-01-01&#39;, freq=&#39;7D&#39;, periods=53)})

I want to see which of these possible days I'll actually play on. I don't feel particularly lucky at the beginning and end of each month, and there are some months I feel especially unlucky about. Therefore I use this function to see if a date qualifies:

def qualifies(dt, excluded_months = []):
    #Date qualifies, if...
    #. it&#39;s on or after the 5th of the month; and
    #. at least 5 days remain till the end of the month (incl. date itself); and
    #. it&#39;s not in one of the months in excluded_months.
    if dt.day &lt; 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days &lt; 5:
        return False
    if dt.month in excluded_months:
        return False
    return True

I hope you realise that this example is still somewhat contrived But it's closer to what I'm trying to do. I try to apply this function in two ways:

df[&#39;qualifies1&#39;] = df[&#39;date&#39;].apply(lambda x: qualifies(x, [3, 8]))
df[&#39;qualifies2&#39;] = np.vectorize(qualifies, excluded=[1])(df[&#39;date&#39;], [3, 8])

As far as I know, both should work, and I'd prefer the latter, as the former is slow and frowned upon. Edit: I've learned that also the first is frowned upon lol.

However, only the first one succeeds, the second one fails with an AttributeError: 'numpy.datetime64' object has no attribute 'day'. And so my question is, if there is a way to use np.vectorize on this function qualifies, which takes a datetime/timestamp as an argument.

Many thanks!

PS: for the interested, this is df:

In [15]: df
Out[15]: 
         date  qualifies1
0  2020-01-01       False
1  2020-01-08        True
2  2020-01-15        True
3  2020-01-22        True
4  2020-01-29       False
5  2020-02-05        True
6  2020-02-12        True
7  2020-02-19        True
8  2020-02-26       False
9  2020-03-04       False
10 2020-03-11       False
11 2020-03-18       False
12 2020-03-25       False
13 2020-04-01       False
14 2020-04-08        True
15 2020-04-15        True
16 2020-04-22        True
17 2020-04-29       False
18 2020-05-06        True
19 2020-05-13        True
20 2020-05-20        True
21 2020-05-27        True
22 2020-06-03       False
23 2020-06-10        True
24 2020-06-17        True
25 2020-06-24        True
26 2020-07-01       False
27 2020-07-08        True
28 2020-07-15        True
29 2020-07-22        True
30 2020-07-29       False
31 2020-08-05       False
32 2020-08-12       False
33 2020-08-19       False
34 2020-08-26       False
35 2020-09-02       False
36 2020-09-09        True
37 2020-09-16        True
38 2020-09-23        True
39 2020-09-30       False
40 2020-10-07        True
41 2020-10-14        True
42 2020-10-21        True
43 2020-10-28       False
44 2020-11-04       False
45 2020-11-11        True
46 2020-11-18        True
47 2020-11-25        True
48 2020-12-02       False
49 2020-12-09        True
50 2020-12-16        True
51 2020-12-23        True
52 2020-12-30       False

答案1

得分: 2

我认为 @rpanai 在原帖上的回答仍然是最好的。在这里，我分享我的测试：

def qualifies(dt, excluded_months=[]):
    if dt.day < 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
        return False
    if dt.month in excluded_months:
        return False
    return True
def new_qualifies(dt, excluded_months=[]):
    dt = pd.Timestamp(dt)
    if dt.day < 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
        return False
    if dt.month in excluded_months:
        return False
    return True
df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', periods=12000)})

应用方法：

%%timeit
df['qualifies1'] = df['date'].apply(lambda x: qualifies(x, [3, 8]))

385毫秒 ± 21.6毫秒每次循环（均值 ± 7次运行的标准差，1次循环每次运行）

转换方法：

%%timeit
df['qualifies1'] = df['date'].apply(lambda x: new_qualifies(x, [3, 8]))

389毫秒 ± 12.6毫秒每次循环（均值 ± 7次运行的标准差，1次循环每次运行）

矢量化代码：

%%timeit
df['qualifies2'] =  np.logical_not((df['date'].dt.day < 5).values | \
    ((df['date']+pd.tseries.offsets.MonthBegin(1)-df['date']).dt.days < 5).values |\
    (df['date'].dt.month.isin([3, 8])).values)

4.83毫秒 ± 117微秒每次循环（均值 ± 7次运行的标准差，100次循环每次运行）

英文:

I think @rpanai answer on the original post is still the best. Here I share my tests:

def qualifies(dt, excluded_months = []):
    if dt.day &lt; 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days &lt; 5:
        return False
    if dt.month in excluded_months:
        return False
    return True
def new_qualifies(dt, excluded_months = []):
    dt = pd.Timestamp(dt)
    if dt.day &lt; 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days &lt; 5:
        return False
    if dt.month in excluded_months:
        return False
    return True
df = pd.DataFrame({&#39;date&#39;: pd.date_range(&#39;2020-01-01&#39;, freq=&#39;7D&#39;, periods=12000)})

apply method:

%%timeit
df[&#39;qualifies1&#39;] = df[&#39;date&#39;].apply(lambda x: qualifies(x, [3, 8]))

385 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

conversion method:

%%timeit
df[&#39;qualifies1&#39;] = df[&#39;date&#39;].apply(lambda x: new_qualifies(x, [3, 8]))

389 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

vectorized code:

%%timeit
df[&#39;qualifies2&#39;] =  np.logical_not((df[&#39;date&#39;].dt.day&lt;5).values | \
    ((df[&#39;date&#39;]+pd.tseries.offsets.MonthBegin(1)-df[&#39;date&#39;]).dt.days &lt; 5).values |\
    (df[&#39;date&#39;].dt.month.isin([3, 8])).values)

4.83 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

答案2

得分: 1

以下是翻译好的部分：

摘要

如果使用 np.vectorize，最好指定 otypes。在这种情况下，错误是由于未指定 otypes 时 vectorize 使用的试算引起的。另一种方法是将Series作为对象类型数组传递。

np.vectorize 有性能免责声明。np.frompyfunc 可能更快，甚至可以使用列表推导。

测试 vectorize

让我们定义一个更简单的函数 - 一个显示参数类型的函数：

def foo(dt, excluded_months=[]):
    print(dt, type(dt))
    return True

和一个较小的数据框：

df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', periods=5)})

测试 vectorize。 (vectorize 文档表示使用 excluded 参数会降低性能，因此我使用了 lambda，就像在 apply 中使用的一样：

np.vectorize(lambda x: foo(x, [3, 8]))(df['date'])

那第一行是会引发问题的 datetime64。其他行是原始的 pandas 对象。如果我指定 otypes，那个问题就消失了：

np.vectorize(lambda x: foo(x, [3, 8]), otypes=['bool'])(df['date'])

然后是 apply：

df['date'].apply(lambda x: foo(x, [3, 8]))

通过将 Series 包装在 np.array 中，会生成 datetime64 数据类型：

np.array(df['date'])

显然，np.vectorize 在执行初始试算计算时会执行此类包装，但在执行主要迭代时不会执行。指定 otypes 跳过了试算计算。在其他情况下，试算计算已经引发了问题，尽管这是一个更不常见的情况。

过去，当我测试过 np.vectorize 时，它比更明确的迭代要慢。它有明确的性能免责声明。当函数接受多个输入并需要广播的好处时，它最有价值。如果只使用一个参数，很难证明它的用途。

np.frompyfunc 是 vectorize 的基础，但返回一个对象数据类型。通常它比数组上的显式迭代快2倍，但与列表上的迭代速度相似。它似乎在创建和使用对象的 numpy 数组时最有用。在这种情况下，我还没有让它起作用。

vectorize 代码

np.vectorize 代码位于 np.lib.function_base.py。

如果未指定 otypes，代码执行以下操作：

args = [asarray(arg) for arg in args]
inputs = [arg.flat[0] for arg in args]
outputs = func(*inputs)

它将每个参数（在这里只有一个）转换为数组，并获取第一个元素。然后将其传递给 func。正如 Out[37] 所示，这将是一个 datetime64 对象。

frompyfunc

要使用 frompyfunc，我需要转换 df['date'] 的数据类型：

np.frompyfunc(lambda x: foo(x, [3, 8]), 1, 1)(df['date'])

没有它，它将 int 传递给函数，有它，它将 pandas 时间对象传递给函数：

np.frompyfunc(lambda x: foo(x, [3, 8]), 1, 1)(df['date'].astype(object))

因此，这种使用 qualifies 的方法有效：

np.frompyfunc(lambda x: qualifies(x, [3, 8]), 1, 1)(df['date'].astype(object))

或者更好的是，普通的 Python 迭代：

[qualifies(x, [3, 8]) for x in df['date']]

英文:

Summary

If using np.vectorize it's best to specify otypes. In this case, the error is caused by the trial calculation the vectorize uses when otypes is not specified. An alternative is to pass the Series as an object type array.

np.vectorize has a performance disclaimer. np.frompyfunc may be faster, or even a list comprehension.

testing vectorize

Let's define a simpler function - one that displays the type of the argument:

In [31]: def foo(dt, excluded_months=[]): 
    ...:     print(dt,type(dt)) 
    ...:     return True

And a smaller dataframe:

In [32]: df = pd.DataFrame({&#39;date&#39;: pd.date_range(&#39;2020-01-01&#39;, freq=&#39;7D&#39;, perio
    ...: ds=5)})                                                                
In [33]: df                                                                     
Out[33]: 
        date
0 2020-01-01
1 2020-01-08
2 2020-01-15
3 2020-01-22
4 2020-01-29

Testing vectorize. (vectorize docs says using the excluded parameter degrades performance, so I'm using lambda as used by with apply):

In [34]: np.vectorize(lambda x:foo(x,[3,8]))(df[&#39;date&#39;])                        
2020-01-01T00:00:00.000000000 &lt;class &#39;numpy.datetime64&#39;&gt;
2020-01-01 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-08 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-15 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-22 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-29 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
Out[34]: array([ True,  True,  True,  True,  True])

That first line is the datetime64 that gives problems. The other lines are the orginal pandas objects. If I specify the otypes, that problem goes away:

In [35]: np.vectorize(lambda x:foo(x,[3,8]), otypes=[&#39;bool&#39;])(df[&#39;date&#39;])       
2020-01-01 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-08 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-15 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-22 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-29 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
Out[35]: array([ True,  True,  True,  True,  True])

the apply:

In [36]: df[&#39;date&#39;].apply(lambda x: foo(x, [3, 8]))                             
2020-01-01 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-08 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-15 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-22 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-29 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
Out[36]: 
0    True
1    True
2    True
3    True
4    True
Name: date, dtype: bool

A datetime64 dtype is produced by wrapping the the Series in np.array.

In [37]: np.array(df[&#39;date&#39;])                                                   
Out[37]: 
array([&#39;2020-01-01T00:00:00.000000000&#39;, &#39;2020-01-08T00:00:00.000000000&#39;,
       &#39;2020-01-15T00:00:00.000000000&#39;, &#39;2020-01-22T00:00:00.000000000&#39;,
       &#39;2020-01-29T00:00:00.000000000&#39;], dtype=&#39;datetime64[ns]&#39;)

Apparently np.vectorize is doing this sort of wrapping when performing the initial trial calculation, but not when doing the main iterations. Specifying the otypes skips that trial calculation. That trial calculation has caused problems in other SO, though this is a more obscure case.

In that past when I've tested np.vectorize it is slower than a more explicit iteration. It does have a clear performance disclaimer. It's most valuable when the function takes several inputs, and needs the benefit of broadcasting. It's hard to justify when using only one argument.

np.frompyfunc underlies vectorize, but returns an object dtype. Often it is 2x faster than explicit iteration on an array, though similar in speed to iteration on a list. It seems to be most useful when creating and working with a numpy array of objects. I haven't gotten it working in this case.

vectorize code

The np.vectorize code is in np.lib.function_base.py.

If otypes is not specified, the code does:

        args = [asarray(arg) for arg in args]
        inputs = [arg.flat[0] for arg in args]
        outputs = func(*inputs)

It makes each argument (here only one) into an array, and takes the first element. And then passes that to the func. As Out[37] shows, that will be a datetime64 object.

frompyfunc

To use frompyfunc, I need to convert the dtype of df['date']:

In [68]: np.frompyfunc(lambda x:foo(x,[3,8]), 1,1)(df[&#39;date&#39;])                  
1577836800000000000 &lt;class &#39;int&#39;&gt;
1578441600000000000 &lt;class &#39;int&#39;&gt;
...

without it, it passes int to the function, with it, it passes the pandas time objects:

In [69]: np.frompyfunc(lambda x:foo(x,[3,8]), 1,1)(df[&#39;date&#39;].astype(object))   
2020-01-01 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
2020-01-08 00:00:00 &lt;class &#39;pandas._libs.tslibs.timestamps.Timestamp&#39;&gt;
...

So this use of qualifies works:

In [71]: np.frompyfunc(lambda x:qualifies(x,[3,8]),1,1)(df[&#39;date&#39;].astype(object))                                                                     
Out[71]: 
0    False
1     True
2     True
3     True
4    False
Name: date, dtype: object

object dtype

For the main iteration, np.vectorize does

      ufunc = frompyfunc(_func, len(args), nout)
      # Convert args to object arrays first
        inputs = [array(a, copy=False, subok=True, dtype=object)
                  for a in args]
        outputs = ufunc(*inputs)

That explains why vectorize with otypes works - it is using frompyfunc with an object dtype input. Contrast this with Out[37]:

In [74]: np.array(df[&#39;date&#39;], dtype=object)                                     
Out[74]: 
array([Timestamp(&#39;2020-01-01 00:00:00&#39;), Timestamp(&#39;2020-01-08 00:00:00&#39;),
       Timestamp(&#39;2020-01-15 00:00:00&#39;), Timestamp(&#39;2020-01-22 00:00:00&#39;),
       Timestamp(&#39;2020-01-29 00:00:00&#39;)], dtype=object)

And an alternative to specifying otypes is to make sure you are passing object dtype to vectorize:

In [75]: np.vectorize(qualifies, excluded=[1])(df[&#39;date&#39;].astype(object), [3, 8])                                                                      
Out[75]: array([False,  True,  True,  True, False])

This appears to be the fastest version:

np.frompyfunc(lambda x: qualifies(x,[3,8]),1,1)(np.array(df[&#39;date&#39;],object))

or better yet, a plain Python iteration:

[qualifies(x,[3,8]) for x in df[&#39;date&#39;]]

答案3

得分: 0

Just as in the original question, I can "solve" the problem by forcing the incoming argument to be a pandas datetime object, by adding dt = pd.to_datetime(dt) before the first if-statement of the function.

To be honest, this feels like patching-up something that's broken and should not be used. I'll just use .apply instead and take the performance hit. Anyone that feels there's a better solution is very much invited to share

英文:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Numpy向量化破坏了数据类型 (2)

问题

答案1

答案2

摘要

测试 vectorize

vectorize 代码

frompyfunc

Summary

testing vectorize

vectorize code

frompyfunc

object dtype

答案3

无法克服“您没有权限修改此应用程序”的问题。

ggplot2：日期范围条形图

如何在扫描电子显微镜图像中快速生成彩色像素而不是灰度像素的掩膜？

即使容器正在运行，也无法访问该容器。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论