问题

问题：

我的方法加载一个单元格需要很长时间（1分钟），我必须重复112200次。我使用for loop和dataset.variable.isel(x=i, y=j).values从我的变量中加载单个1D数组。是否有更好的方法？另外，知道我的数据集被分块了，是否有一种方式可以同时对所有块进行并行处理？

示例代码：

# 设置
import xarray as xr
import numpy as np

# 创建维度
x = np.linspace(0, 99, 100)
y = np.linspace(0, 349, 350)
time = np.linspace(0, 299, 300)

# 创建数据集
xrds= xr.Dataset()

# 将维度添加到数据集
xrds['time'] = time
xrds['y'] = y
xrds['x'] = x

# 创建具有分块的随机数据变量
chunksize = (10, 100, 100)  # 变量的分块大小
data_var1 = np.random.rand(len(time), len(y), len(x))
data_var2 = np.random.rand(len(time), len(y), len(x))
data_var3 = np.random.rand(len(time), len(y), len(x))

xrds['data_var1'] = (('time', 'y', 'x'), data_var1, {'chunks': chunksize})
xrds['data_var2'] = (('time', 'y', 'x'), data_var2, {'chunks': chunksize})
xrds['data_var3'] = (('time', 'y', 'x'), data_var3, {'chunks': chunksize})

#### ---- 我的尝试 ---- ####

# 遍历数据集中的所有变量
for var_name, var_data in xrds.data_vars.items():

    # 如果变量是3D的
    if var_data.shape == (xrds.dims['time'], xrds.dims['y'], xrds.dims['x']):

        # 遍历变量的每个单元格，仅沿x和y轴
        for i in range(xrds.dims['y']):
            for j in range(xrds.dims['x']):

                # 将单个1D单元格加载到内存中（len(cell) = len(time)）
                print(xrds.v.isel(y=i,x=j).values)

(Note: In your code, there are some syntax issues like the use of xrds.v instead of xrds[var_name], and v is not defined in your code. Make sure to correct those issues in your code.)

英文:

Context:

I have a datacube with 3 variables (3D arrays, dims:time,y,x). The datacube is too big to fit in memory so I chunk it with xarray/dask. I want to apply a function to every cell in x,y of every variable in my datacube.

Problem:

My method takes a long time to load only one cell (1 minute) and I have to do that 112200 times. I use a for loop with dataset.variable.isel(x=i, y=j).values to load a single 1D array from my variables. Is there a better way to do that ? Also, knowing my dataset is chunked, is there a way to do that in parallel for all the chunks at once ?

Code example:

# Setup
import xarray as xr
import numpy as np

# Create the dimensions
x = np.linspace(0, 99, 100)
y = np.linspace(0, 349, 350)
time = np.linspace(0, 299, 300)

# Create the dataset
xrds= xr.Dataset()

# Add the dimensions to the dataset
xrds[&#39;time&#39;] = time
xrds[&#39;y&#39;] = y
xrds[&#39;x&#39;] = x



# Create the random data variables with chunking
chunksize = (10, 100, 100)  # Chunk size for the variables
data_var1 = np.random.rand(len(time), len(y), len(x))
data_var2 = np.random.rand(len(time), len(y), len(x))
data_var3 = np.random.rand(len(time), len(y), len(x))

xrds[&#39;data_var1&#39;] = ((&#39;time&#39;, &#39;y&#39;, &#39;x&#39;), data_var1, {&#39;chunks&#39;: chunksize})
xrds[&#39;data_var2&#39;] = ((&#39;time&#39;, &#39;y&#39;, &#39;x&#39;), data_var2, {&#39;chunks&#39;: chunksize})
xrds[&#39;data_var3&#39;] = ((&#39;time&#39;, &#39;y&#39;, &#39;x&#39;), data_var3, {&#39;chunks&#39;: chunksize})

#### ---- My Attempt ---- ####

# Iterate through all the variables in my dataset
for var_name, var_data in xrds.data_vars.items():

    # if variable is 3D
    if var_data.shape == (xrds.dims[&#39;time&#39;], xrds.dims[&#39;y&#39;], xrds.dims[&#39;x&#39;]):

        # Iterate through every cell of the variable along the x and y axis only
        for i in range(xrds.dims[&#39;y&#39;]):
            for j in range(xrds.dims[&#39;x&#39;]):

                # Load a single 1D cell into memory (len(cell) = len(time))
                print(xrds.v.isel(y=i,x=j).values)

答案1

得分: 1

我发现显式迭代xarray比isel()快大约10%。

示例：

for var_name, var_data in xrds.data_vars.items():

    # 如果变量是3D
    if var_data.shape == (xrds.dims['time'], xrds.dims['y'], xrds.dims['x']):

        # 沿x和y轴迭代变量的每个单元格
        for i_array in xrds['data_var1'].transpose('x', 'y', 'time'):
            x_coordinate = i_array.x.item()
            for cell in i_array.transpose('y', 'time'):
                y_coordinate = cell.y.item()
                # 对单元格执行操作

这需要17.38秒，而原始方法需要20.47秒。

附注：chunksize = (10, 100, 100) 这一行对我来说似乎很可疑。如果你想一次加载与整个time轴对应的数组，那么块的大小应该设置为不需要查看多个块。似乎 chunksize = (len(time), 100, 100) 会更有效率。然而，我以这两种方式进行了基准测试，对于这个数据大小没有任何区别。但在您的更大问题上可能会有所不同。

英文:

I find that explicitly iterating over the xarray is faster than isel(), by about 10%.

Example:

    for var_name, var_data in xrds.data_vars.items():

        # if variable is 3D
        if var_data.shape == (xrds.dims[&#39;time&#39;], xrds.dims[&#39;y&#39;], xrds.dims[&#39;x&#39;]):

            # Iterate through every cell of the variable along the x and y axis only
            for i_array in xrds[&#39;data_var1&#39;].transpose(&#39;x&#39;, &#39;y&#39;, &#39;time&#39;):
                x_coordinate = i_array.x.item()
                for cell in i_array.transpose(&#39;y&#39;, &#39;time&#39;):
                    y_coordinate = cell.y.item()
                    # Do something with cell

This takes 17.38s, versus 20.47s for the original.

PS: The line chunksize = (10, 100, 100) seems very suspicious to me. It seems like if you want to load an array corresponding to the entire time axis at once, the chunks should be set so that this doesn't require looking at multiple chunks. It seems like chunksize = (len(time), 100, 100) would be more efficient. However, I benchmarked this both ways and it doesn't make a difference for this data size. May make a difference on your larger problem.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

分块的xarray：高效加载内存中的单元格。

问题

答案1

如何迭代两个文件并仅提取匹配前的一行。

如何包括一个用于计算猜测数字的函数？

重组一个2D的NumPy数组，基于匹配的列数值。

Left Outer Join两个单列数据框

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论