
huangapple go评论55阅读模式

What is the most efficient way in Python to compute mean values within a grid cell?



"I have a 2d array of data points (X) with corresponding observations (z) and would like to compute a grid of the mean values of z for each cell.

Using nested for loops with Numpy is inefficient. Is there a faster way using a built-in function or list comprehension? I would like to avoid Numba/jit if possible.

import numpy

x = numpy.random.rand(1000000)
y = numpy.random.rand(1000000)
z = numpy.random.rand(1000000)

nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
    for j in range(ny):
        zm[i,j] = numpy.mean(z, where = ((x>xl[i]) & (x<=xl[i+1]) & (y>yl[j]) & (y<=yl[j+1]))) #4.5 ms/loop = 75 minutes    

Or, 2D variant:

import numpy

X = numpy.array([numpy.random.rand(1000000),numpy.random.rand(1000000)]).T
z = numpy.random.rand(1000000)

nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
    for j in range(ny):
        zm[i,j] = numpy.mean(z, where = ((X[:,0]>xl[i]) & (X[:,0]<=xl[i+1]) & (X[:,1]>yl[j]) & (X[:,1]<=yl[j+1]))) #4.5 ms/loop = 75 minutes

I have a 2d array of data points (X) with corresponding observations (z) and would like to compute a grid of the mean values of z for each cell.

Using nested for loops with Numpy is inefficient. Is there a faster way using a built-in function or list comprehension? I would like to avoid Numba/jit if possible.

import numpy

x = numpy.random.rand(1000000)
y = numpy.random.rand(1000000)
z = numpy.random.rand(1000000)

nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
    for j in range(ny):
        zm[i,j] = numpy.mean(z, where = ((x>xl[i]) & (x<=xl[i+1]) & (y>yl[j]) & (y<=yl[j+1]))) #4.5 ms/loop = 75 minutes    

Or, 2D variant:

import numpy

X = numpy.array([numpy.random.rand(1000000),numpy.random.rand(1000000)]).T
z = numpy.random.rand(1000000)

nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
    for j in range(ny):
        zm[i,j] = numpy.mean(z, where = ((X[:,0]>xl[i]) & (X[:,0]<=xl[i+1]) & (X[:,1]>yl[j]) & (X[:,1]<=yl[j+1]))) #4.5 ms/loop = 75 minutes


得分: 0

import pandas as pd

df = pd.DataFrame({'x': pd.cut(x, xl, labels=range(nx)),
                   'y': pd.cut(y, yl, labels=range(ny)),
                   'z': z})

out = (df.groupby(['x', 'y'])['z'].mean().unstack()
         .reindex(index=range(nx), columns=range(ny))
3.95 秒 ± 1.83 秒每循环(7 次运行的平均值 ± 标准偏差,每循环 1 次)

In [tag:pandas] you could use:

import pandas as pd

df = pd.DataFrame({'x': pd.cut(x, xl, labels=range(nx)),
                   'y': pd.cut(y, yl, labels=range(ny)),
                   'z': z})

out = (df.groupby(['x', 'y'])['z'].mean().unstack()
         .reindex(index=range(nx), columns=range(ny))

Running time:

3.95 s ± 1.83 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


得分: 0


import warnings
import numpy as np

x, y, z = np.random.rand(3, 1000000)
nx = ny = 1000

zsum, _, _ = np.histogram2d(x, y, bins=(nx, ny), range=((0,1), (0,1)), weights=z)
zcount, _, _ = np.histogram2d(x, y, bins=(nx, ny), range=((0,1), (0,1)))
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=RuntimeWarning)
    out = zsum/zcount




请注意,你应该稍微具体一些,因为你可以优化算法以减少执行时间,还可以优化内存消耗、能源消耗等。从评论“4.5 ms/loop = 75 minutes”中可以清楚地看出,你在谈论执行时间。

在我的笔记本电脑上,使用@mozway的pandas方法需要:1.1 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

而使用numpy.histogram2d的这个答案需要508 ms ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


What you are building is actually a sort of histogram.
So you could use the provided numpy functions as follows:

import warnings
import numpy as np

x, y, z = np.random.rand(3, 1000000)
nx = ny = 1000

zsum, _, _ = np.histogram2d(x,y, bins=(nx, ny), range=((0,1), (0,1)), weights=z)
zcount, _, _ = np.histogram2d(x, y, bins=(nx, ny), range=((0,1), (0,1)))
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=RuntimeWarning)
    out = zsum/zcount


Here, I suppress the runtime warning complaining about division by zero, which in our case is intended to get the NaN values. Furthermore, I ignore the x and y edges returned by the histogram2d function as second and third argument and assign it to the throw away name _.
I create one histogram with the weights to sum up the values and another one with counts to get the average by the ratio.


Note, that you should specify a bit what you mean by it, because you can optimize an algorithm with respect to execution time, but also with respect to memory consumption, energy consumption, etc. From the comment 4.5 ms/loop = 75 minutes it becomes clear that you are talking about execution time.

On my laptop, the answer using pandas by @mozway takes: 1.1 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each).

This answer using numpy.histogram2d takes 508 ms ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each).

  • 本文由 发表于 2023年6月15日 02:37:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76476636.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
