2023年6月15日 02:37:00go评论115阅读模式

英文:

What is the most efficient way in Python to compute mean values within a grid cell?

问题

以下是您要翻译的内容：

"I have a 2d array of data points (X) with corresponding observations (z) and would like to compute a grid of the mean values of z for each cell.

Using nested for loops with Numpy is inefficient. Is there a faster way using a built-in function or list comprehension? I would like to avoid Numba/jit if possible.

import numpy

x = numpy.random.rand(1000000)
y = numpy.random.rand(1000000)
z = numpy.random.rand(1000000)

nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
    for j in range(ny):
        zm[i,j] = numpy.mean(z, where = ((x&gt;xl[i]) &amp; (x&lt;=xl[i+1]) &amp; (y&gt;yl[j]) &amp; (y&lt;=yl[j+1]))) #4.5 ms/loop = 75 minutes

Or, 2D variant:

import numpy

X = numpy.array([numpy.random.rand(1000000),numpy.random.rand(1000000)]).T
z = numpy.random.rand(1000000)

nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
    print(i)
    for j in range(ny):
        zm[i,j] = numpy.mean(z, where = ((X[:,0]&gt;xl[i]) &amp; (X[:,0]&lt;=xl[i+1]) &amp; (X[:,1]&gt;yl[j]) &amp; (X[:,1]&lt;=yl[j+1]))) #4.5 ms/loop = 75 minutes

英文:

I have a 2d array of data points (X) with corresponding observations (z) and would like to compute a grid of the mean values of z for each cell.

Using nested for loops with Numpy is inefficient. Is there a faster way using a built-in function or list comprehension? I would like to avoid Numba/jit if possible.

import numpy

x = numpy.random.rand(1000000)
y = numpy.random.rand(1000000)
z = numpy.random.rand(1000000)

nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
    for j in range(ny):
        zm[i,j] = numpy.mean(z, where = ((x&gt;xl[i]) &amp; (x&lt;=xl[i+1]) &amp; (y&gt;yl[j]) &amp; (y&lt;=yl[j+1]))) #4.5 ms/loop = 75 minutes

Or, 2D variant:

import numpy

X = numpy.array([numpy.random.rand(1000000),numpy.random.rand(1000000)]).T
z = numpy.random.rand(1000000)

nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
    print(i)
    for j in range(ny):
        zm[i,j] = numpy.mean(z, where = ((X[:,0]&gt;xl[i]) &amp; (X[:,0]&lt;=xl[i+1]) &amp; (X[:,1]&gt;yl[j]) &amp; (X[:,1]&lt;=yl[j+1]))) #4.5 ms/loop = 75 minutes

答案1

得分: 0

import pandas as pd

df = pd.DataFrame({'x': pd.cut(x, xl, labels=range(nx)),
                   'y': pd.cut(y, yl, labels=range(ny)),
                   'z': z})

out = (df.groupby(['x', 'y'])['z'].mean().unstack()
         .reindex(index=range(nx), columns=range(ny))
         .to_numpy()
      )

运行时间：
3.95 秒 ± 1.83 秒每循环（7 次运行的平均值 ± 标准偏差，每循环 1 次）

英文:

In [tag:pandas] you could use:

import pandas as pd

df = pd.DataFrame({&#39;x&#39;: pd.cut(x, xl, labels=range(nx)),
                   &#39;y&#39;: pd.cut(y, yl, labels=range(ny)),
                   &#39;z&#39;: z})

out = (df.groupby([&#39;x&#39;, &#39;y&#39;])[&#39;z&#39;].mean().unstack()
         .reindex(index=range(nx), columns=range(ny))
         .to_numpy()
      )

Running time:

3.95 s &#177; 1.83 s per loop (mean &#177; std. dev. of 7 runs, 1 loop each)

答案2

得分: 0

你正在构建的实际上是一种直方图。因此，你可以使用提供的numpy函数如下所示：

import warnings
import numpy as np

x, y, z = np.random.rand(3, 1000000)
nx = ny = 1000

zsum, _, _ = np.histogram2d(x, y, bins=(nx, ny), range=((0,1), (0,1)), weights=z)
zcount, _, _ = np.histogram2d(x, y, bins=(nx, ny), range=((0,1), (0,1)))
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=RuntimeWarning)
    out = zsum/zcount

print(out)

这里，我抑制了有关除以零的运行时警告，因为在我们的情况下，这是为了得到NaN值而有意的。此外，我忽略了histogram2d函数返回的x和y边缘，将它们分配给了丢弃的名称_。我创建了一个带有权重的直方图来累加值，另一个带有计数的直方图来通过比率获得平均值。

效率：

请注意，你应该稍微具体一些，因为你可以优化算法以减少执行时间，还可以优化内存消耗、能源消耗等。从评论“4.5 ms/loop = 75 minutes”中可以清楚地看出，你在谈论执行时间。

在我的笔记本电脑上，使用@mozway的pandas方法需要：1.1 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)。

而使用numpy.histogram2d的这个答案需要508 ms ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)。

英文:

What you are building is actually a sort of histogram.
So you could use the provided numpy functions as follows:

import warnings
import numpy as np

x, y, z = np.random.rand(3, 1000000)
nx = ny = 1000

zsum, _, _ = np.histogram2d(x,y, bins=(nx, ny), range=((0,1), (0,1)), weights=z)
zcount, _, _ = np.histogram2d(x, y, bins=(nx, ny), range=((0,1), (0,1)))
with warnings.catch_warnings():
    warnings.simplefilter(&quot;ignore&quot;, category=RuntimeWarning)
    out = zsum/zcount

print(out)

Here, I suppress the runtime warning complaining about division by zero, which in our case is intended to get the NaN values. Furthermore, I ignore the x and y edges returned by the histogram2d function as second and third argument and assign it to the throw away name _.
I create one histogram with the weights to sum up the values and another one with counts to get the average by the ratio.

efficiency:

Note, that you should specify a bit what you mean by it, because you can optimize an algorithm with respect to execution time, but also with respect to memory consumption, energy consumption, etc. From the comment 4.5 ms/loop = 75 minutes it becomes clear that you are talking about execution time.

On my laptop, the answer using pandas by @mozway takes: 1.1 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each).

This answer using numpy.histogram2d takes 508 ms ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Python中计算网格单元内的均值的最有效方法是什么？

问题

答案1

答案2

Element wise 或列求和三元组的总和

Numpy向量化操作会导致数据类型混乱。

如何使用增量步长创建一个numpy.arange？

如何清理带有不同格式名称的列（用逗号、点等分隔）？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论