在Python中计算网格单元内的均值的最有效方法是什么?

huangapple go评论55阅读模式
英文:

What is the most efficient way in Python to compute mean values within a grid cell?

问题

以下是您要翻译的内容:

"I have a 2d array of data points (X) with corresponding observations (z) and would like to compute a grid of the mean values of z for each cell.

Using nested for loops with Numpy is inefficient. Is there a faster way using a built-in function or list comprehension? I would like to avoid Numba/jit if possible.

import numpy

x = numpy.random.rand(1000000)
y = numpy.random.rand(1000000)
z = numpy.random.rand(1000000)

nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
    for j in range(ny):
        zm[i,j] = numpy.mean(z, where = ((x>xl[i]) & (x<=xl[i+1]) & (y>yl[j]) & (y<=yl[j+1]))) #4.5 ms/loop = 75 minutes    

Or, 2D variant:

import numpy

X = numpy.array([numpy.random.rand(1000000),numpy.random.rand(1000000)]).T
z = numpy.random.rand(1000000)

nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
    print(i)
    for j in range(ny):
        zm[i,j] = numpy.mean(z, where = ((X[:,0]>xl[i]) & (X[:,0]<=xl[i+1]) & (X[:,1]>yl[j]) & (X[:,1]<=yl[j+1]))) #4.5 ms/loop = 75 minutes
英文:

I have a 2d array of data points (X) with corresponding observations (z) and would like to compute a grid of the mean values of z for each cell.

Using nested for loops with Numpy is inefficient. Is there a faster way using a built-in function or list comprehension? I would like to avoid Numba/jit if possible.

import numpy

x = numpy.random.rand(1000000)
y = numpy.random.rand(1000000)
z = numpy.random.rand(1000000)

nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
    for j in range(ny):
        zm[i,j] = numpy.mean(z, where = ((x>xl[i]) & (x<=xl[i+1]) & (y>yl[j]) & (y<=yl[j+1]))) #4.5 ms/loop = 75 minutes    

Or, 2D variant:

import numpy

X = numpy.array([numpy.random.rand(1000000),numpy.random.rand(1000000)]).T
z = numpy.random.rand(1000000)

nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
    print(i)
    for j in range(ny):
        zm[i,j] = numpy.mean(z, where = ((X[:,0]>xl[i]) & (X[:,0]<=xl[i+1]) & (X[:,1]>yl[j]) & (X[:,1]<=yl[j+1]))) #4.5 ms/loop = 75 minutes

答案1

得分: 0

import pandas as pd

df = pd.DataFrame({'x': pd.cut(x, xl, labels=range(nx)),
                   'y': pd.cut(y, yl, labels=range(ny)),
                   'z': z})

out = (df.groupby(['x', 'y'])['z'].mean().unstack()
         .reindex(index=range(nx), columns=range(ny))
         .to_numpy()
      )
运行时间:
3.95 秒 ± 1.83 秒每循环(7 次运行的平均值 ± 标准偏差,每循环 1 次)
英文:

In [tag:pandas] you could use:

import pandas as pd

df = pd.DataFrame({'x': pd.cut(x, xl, labels=range(nx)),
                   'y': pd.cut(y, yl, labels=range(ny)),
                   'z': z})

out = (df.groupby(['x', 'y'])['z'].mean().unstack()
         .reindex(index=range(nx), columns=range(ny))
         .to_numpy()
      )

Running time:

3.95 s ± 1.83 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

答案2

得分: 0

你正在构建的实际上是一种直方图。因此,你可以使用提供的numpy函数如下所示:

import warnings
import numpy as np

x, y, z = np.random.rand(3, 1000000)
nx = ny = 1000

zsum, _, _ = np.histogram2d(x, y, bins=(nx, ny), range=((0,1), (0,1)), weights=z)
zcount, _, _ = np.histogram2d(x, y, bins=(nx, ny), range=((0,1), (0,1)))
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=RuntimeWarning)
    out = zsum/zcount

print(out)

这里,我抑制了有关除以零的运行时警告,因为在我们的情况下,这是为了得到NaN值而有意的。此外,我忽略了histogram2d函数返回的x和y边缘,将它们分配给了丢弃的名称_。我创建了一个带有权重的直方图来累加值,另一个带有计数的直方图来通过比率获得平均值。

效率:

请注意,你应该稍微具体一些,因为你可以优化算法以减少执行时间,还可以优化内存消耗、能源消耗等。从评论“4.5 ms/loop = 75 minutes”中可以清楚地看出,你在谈论执行时间。

在我的笔记本电脑上,使用@mozway的pandas方法需要:1.1 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

而使用numpy.histogram2d的这个答案需要508 ms ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

英文:

What you are building is actually a sort of histogram.
So you could use the provided numpy functions as follows:

import warnings
import numpy as np

x, y, z = np.random.rand(3, 1000000)
nx = ny = 1000

zsum, _, _ = np.histogram2d(x,y, bins=(nx, ny), range=((0,1), (0,1)), weights=z)
zcount, _, _ = np.histogram2d(x, y, bins=(nx, ny), range=((0,1), (0,1)))
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=RuntimeWarning)
    out = zsum/zcount

print(out)

Here, I suppress the runtime warning complaining about division by zero, which in our case is intended to get the NaN values. Furthermore, I ignore the x and y edges returned by the histogram2d function as second and third argument and assign it to the throw away name _.
I create one histogram with the weights to sum up the values and another one with counts to get the average by the ratio.

efficiency:

Note, that you should specify a bit what you mean by it, because you can optimize an algorithm with respect to execution time, but also with respect to memory consumption, energy consumption, etc. From the comment 4.5 ms/loop = 75 minutes it becomes clear that you are talking about execution time.

On my laptop, the answer using pandas by @mozway takes: 1.1 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each).

This answer using numpy.histogram2d takes 508 ms ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each).

huangapple
  • 本文由 发表于 2023年6月15日 02:37:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76476636.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定