英文:
What is the most efficient way in Python to compute mean values within a grid cell?
问题
以下是您要翻译的内容:
"I have a 2d array of data points (X) with corresponding observations (z) and would like to compute a grid of the mean values of z for each cell.
Using nested for loops with Numpy is inefficient. Is there a faster way using a built-in function or list comprehension? I would like to avoid Numba/jit if possible.
import numpy
x = numpy.random.rand(1000000)
y = numpy.random.rand(1000000)
z = numpy.random.rand(1000000)
nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
for j in range(ny):
zm[i,j] = numpy.mean(z, where = ((x>xl[i]) & (x<=xl[i+1]) & (y>yl[j]) & (y<=yl[j+1]))) #4.5 ms/loop = 75 minutes
Or, 2D variant:
import numpy
X = numpy.array([numpy.random.rand(1000000),numpy.random.rand(1000000)]).T
z = numpy.random.rand(1000000)
nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
print(i)
for j in range(ny):
zm[i,j] = numpy.mean(z, where = ((X[:,0]>xl[i]) & (X[:,0]<=xl[i+1]) & (X[:,1]>yl[j]) & (X[:,1]<=yl[j+1]))) #4.5 ms/loop = 75 minutes
英文:
I have a 2d array of data points (X) with corresponding observations (z) and would like to compute a grid of the mean values of z for each cell.
Using nested for loops with Numpy is inefficient. Is there a faster way using a built-in function or list comprehension? I would like to avoid Numba/jit if possible.
import numpy
x = numpy.random.rand(1000000)
y = numpy.random.rand(1000000)
z = numpy.random.rand(1000000)
nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
for j in range(ny):
zm[i,j] = numpy.mean(z, where = ((x>xl[i]) & (x<=xl[i+1]) & (y>yl[j]) & (y<=yl[j+1]))) #4.5 ms/loop = 75 minutes
Or, 2D variant:
import numpy
X = numpy.array([numpy.random.rand(1000000),numpy.random.rand(1000000)]).T
z = numpy.random.rand(1000000)
nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
print(i)
for j in range(ny):
zm[i,j] = numpy.mean(z, where = ((X[:,0]>xl[i]) & (X[:,0]<=xl[i+1]) & (X[:,1]>yl[j]) & (X[:,1]<=yl[j+1]))) #4.5 ms/loop = 75 minutes
答案1
得分: 0
import pandas as pd
df = pd.DataFrame({'x': pd.cut(x, xl, labels=range(nx)),
'y': pd.cut(y, yl, labels=range(ny)),
'z': z})
out = (df.groupby(['x', 'y'])['z'].mean().unstack()
.reindex(index=range(nx), columns=range(ny))
.to_numpy()
)
运行时间:
3.95 秒 ± 1.83 秒每循环(7 次运行的平均值 ± 标准偏差,每循环 1 次)
英文:
In [tag:pandas] you could use:
import pandas as pd
df = pd.DataFrame({'x': pd.cut(x, xl, labels=range(nx)),
'y': pd.cut(y, yl, labels=range(ny)),
'z': z})
out = (df.groupby(['x', 'y'])['z'].mean().unstack()
.reindex(index=range(nx), columns=range(ny))
.to_numpy()
)
Running time:
3.95 s ± 1.83 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
答案2
得分: 0
你正在构建的实际上是一种直方图。因此,你可以使用提供的numpy函数如下所示:
import warnings
import numpy as np
x, y, z = np.random.rand(3, 1000000)
nx = ny = 1000
zsum, _, _ = np.histogram2d(x, y, bins=(nx, ny), range=((0,1), (0,1)), weights=z)
zcount, _, _ = np.histogram2d(x, y, bins=(nx, ny), range=((0,1), (0,1)))
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=RuntimeWarning)
out = zsum/zcount
print(out)
这里,我抑制了有关除以零的运行时警告,因为在我们的情况下,这是为了得到NaN值而有意的。此外,我忽略了histogram2d
函数返回的x和y边缘,将它们分配给了丢弃的名称_
。我创建了一个带有权重的直方图来累加值,另一个带有计数的直方图来通过比率获得平均值。
效率:
请注意,你应该稍微具体一些,因为你可以优化算法以减少执行时间,还可以优化内存消耗、能源消耗等。从评论“4.5 ms/loop = 75 minutes”中可以清楚地看出,你在谈论执行时间。
在我的笔记本电脑上,使用@mozway的pandas
方法需要:1.1 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
。
而使用numpy.histogram2d
的这个答案需要508 ms ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
。
英文:
What you are building is actually a sort of histogram.
So you could use the provided numpy functions as follows:
import warnings
import numpy as np
x, y, z = np.random.rand(3, 1000000)
nx = ny = 1000
zsum, _, _ = np.histogram2d(x,y, bins=(nx, ny), range=((0,1), (0,1)), weights=z)
zcount, _, _ = np.histogram2d(x, y, bins=(nx, ny), range=((0,1), (0,1)))
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=RuntimeWarning)
out = zsum/zcount
print(out)
Here, I suppress the runtime warning complaining about division by zero, which in our case is intended to get the NaN values. Furthermore, I ignore the x and y edges returned by the histogram2d
function as second and third argument and assign it to the throw away name _
.
I create one histogram with the weights to sum up the values and another one with counts to get the average by the ratio.
efficiency:
Note, that you should specify a bit what you mean by it, because you can optimize an algorithm with respect to execution time, but also with respect to memory consumption, energy consumption, etc. From the comment 4.5 ms/loop = 75 minutes
it becomes clear that you are talking about execution time.
On my laptop, the answer using pandas
by @mozway takes: 1.1 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
.
This answer using numpy.histogram2d
takes 508 ms ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论