分块的xarray:高效加载内存中的单元格。

huangapple go评论59阅读模式
英文:

Chunked xarray: load only 1 cell in memory efficiently

问题

问题:

我的方法加载一个单元格需要很长时间(1分钟),我必须重复112200次。我使用for loopdataset.variable.isel(x=i, y=j).values从我的变量中加载单个1D数组。是否有更好的方法?另外,知道我的数据集被分块了,是否有一种方式可以同时对所有块进行并行处理?

示例代码:

# 设置
import xarray as xr
import numpy as np

# 创建维度
x = np.linspace(0, 99, 100)
y = np.linspace(0, 349, 350)
time = np.linspace(0, 299, 300)

# 创建数据集
xrds= xr.Dataset()

# 将维度添加到数据集
xrds['time'] = time
xrds['y'] = y
xrds['x'] = x

# 创建具有分块的随机数据变量
chunksize = (10, 100, 100)  # 变量的分块大小
data_var1 = np.random.rand(len(time), len(y), len(x))
data_var2 = np.random.rand(len(time), len(y), len(x))
data_var3 = np.random.rand(len(time), len(y), len(x))

xrds['data_var1'] = (('time', 'y', 'x'), data_var1, {'chunks': chunksize})
xrds['data_var2'] = (('time', 'y', 'x'), data_var2, {'chunks': chunksize})
xrds['data_var3'] = (('time', 'y', 'x'), data_var3, {'chunks': chunksize})

#### ---- 我的尝试 ---- ####

# 遍历数据集中的所有变量
for var_name, var_data in xrds.data_vars.items():

    # 如果变量是3D的
    if var_data.shape == (xrds.dims['time'], xrds.dims['y'], xrds.dims['x']):

        # 遍历变量的每个单元格,仅沿x和y轴
        for i in range(xrds.dims['y']):
            for j in range(xrds.dims['x']):

                # 将单个1D单元格加载到内存中(len(cell) = len(time))
                print(xrds.v.isel(y=i,x=j).values)

(Note: In your code, there are some syntax issues like the use of xrds.v instead of xrds[var_name], and v is not defined in your code. Make sure to correct those issues in your code.)

英文:

Context:

I have a datacube with 3 variables (3D arrays, dims:time,y,x). The datacube is too big to fit in memory so I chunk it with xarray/dask. I want to apply a function to every cell in x,y of every variable in my datacube.

Problem:

My method takes a long time to load only one cell (1 minute) and I have to do that 112200 times. I use a for loop with dataset.variable.isel(x=i, y=j).values to load a single 1D array from my variables. Is there a better way to do that ? Also, knowing my dataset is chunked, is there a way to do that in parallel for all the chunks at once ?

Code example:

# Setup
import xarray as xr
import numpy as np

# Create the dimensions
x = np.linspace(0, 99, 100)
y = np.linspace(0, 349, 350)
time = np.linspace(0, 299, 300)

# Create the dataset
xrds= xr.Dataset()

# Add the dimensions to the dataset
xrds['time'] = time
xrds['y'] = y
xrds['x'] = x



# Create the random data variables with chunking
chunksize = (10, 100, 100)  # Chunk size for the variables
data_var1 = np.random.rand(len(time), len(y), len(x))
data_var2 = np.random.rand(len(time), len(y), len(x))
data_var3 = np.random.rand(len(time), len(y), len(x))

xrds['data_var1'] = (('time', 'y', 'x'), data_var1, {'chunks': chunksize})
xrds['data_var2'] = (('time', 'y', 'x'), data_var2, {'chunks': chunksize})
xrds['data_var3'] = (('time', 'y', 'x'), data_var3, {'chunks': chunksize})

#### ---- My Attempt ---- ####

# Iterate through all the variables in my dataset
for var_name, var_data in xrds.data_vars.items():

    # if variable is 3D
    if var_data.shape == (xrds.dims['time'], xrds.dims['y'], xrds.dims['x']):

        # Iterate through every cell of the variable along the x and y axis only
        for i in range(xrds.dims['y']):
            for j in range(xrds.dims['x']):

                # Load a single 1D cell into memory (len(cell) = len(time))
                print(xrds.v.isel(y=i,x=j).values)

答案1

得分: 1

我发现显式迭代xarray比isel()快大约10%。

示例:

for var_name, var_data in xrds.data_vars.items():

    # 如果变量是3D
    if var_data.shape == (xrds.dims['time'], xrds.dims['y'], xrds.dims['x']):

        # 沿x和y轴迭代变量的每个单元格
        for i_array in xrds['data_var1'].transpose('x', 'y', 'time'):
            x_coordinate = i_array.x.item()
            for cell in i_array.transpose('y', 'time'):
                y_coordinate = cell.y.item()
                # 对单元格执行操作

这需要17.38秒,而原始方法需要20.47秒。

附注:chunksize = (10, 100, 100) 这一行对我来说似乎很可疑。如果你想一次加载与整个time轴对应的数组,那么块的大小应该设置为不需要查看多个块。似乎 chunksize = (len(time), 100, 100) 会更有效率。然而,我以这两种方式进行了基准测试,对于这个数据大小没有任何区别。但在您的更大问题上可能会有所不同。

英文:

I find that explicitly iterating over the xarray is faster than isel(), by about 10%.

Example:

    for var_name, var_data in xrds.data_vars.items():

        # if variable is 3D
        if var_data.shape == (xrds.dims['time'], xrds.dims['y'], xrds.dims['x']):

            # Iterate through every cell of the variable along the x and y axis only
            for i_array in xrds['data_var1'].transpose('x', 'y', 'time'):
                x_coordinate = i_array.x.item()
                for cell in i_array.transpose('y', 'time'):
                    y_coordinate = cell.y.item()
                    # Do something with cell

This takes 17.38s, versus 20.47s for the original.

PS: The line chunksize = (10, 100, 100) seems very suspicious to me. It seems like if you want to load an array corresponding to the entire time axis at once, the chunks should be set so that this doesn't require looking at multiple chunks. It seems like chunksize = (len(time), 100, 100) would be more efficient. However, I benchmarked this both ways and it doesn't make a difference for this data size. May make a difference on your larger problem.

huangapple
  • 本文由 发表于 2023年6月2日 02:20:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76384679.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定