英文:
Chunked xarray: load only 1 cell in memory efficiently
问题
问题:
我的方法加载一个单元格需要很长时间(1分钟),我必须重复112200次。我使用for loop
和dataset.variable.isel(x=i, y=j).values
从我的变量中加载单个1D数组。是否有更好的方法?另外,知道我的数据集被分块了,是否有一种方式可以同时对所有块进行并行处理?
示例代码:
# 设置
import xarray as xr
import numpy as np
# 创建维度
x = np.linspace(0, 99, 100)
y = np.linspace(0, 349, 350)
time = np.linspace(0, 299, 300)
# 创建数据集
xrds= xr.Dataset()
# 将维度添加到数据集
xrds['time'] = time
xrds['y'] = y
xrds['x'] = x
# 创建具有分块的随机数据变量
chunksize = (10, 100, 100) # 变量的分块大小
data_var1 = np.random.rand(len(time), len(y), len(x))
data_var2 = np.random.rand(len(time), len(y), len(x))
data_var3 = np.random.rand(len(time), len(y), len(x))
xrds['data_var1'] = (('time', 'y', 'x'), data_var1, {'chunks': chunksize})
xrds['data_var2'] = (('time', 'y', 'x'), data_var2, {'chunks': chunksize})
xrds['data_var3'] = (('time', 'y', 'x'), data_var3, {'chunks': chunksize})
#### ---- 我的尝试 ---- ####
# 遍历数据集中的所有变量
for var_name, var_data in xrds.data_vars.items():
# 如果变量是3D的
if var_data.shape == (xrds.dims['time'], xrds.dims['y'], xrds.dims['x']):
# 遍历变量的每个单元格,仅沿x和y轴
for i in range(xrds.dims['y']):
for j in range(xrds.dims['x']):
# 将单个1D单元格加载到内存中(len(cell) = len(time))
print(xrds.v.isel(y=i,x=j).values)
(Note: In your code, there are some syntax issues like the use of xrds.v
instead of xrds[var_name]
, and v
is not defined in your code. Make sure to correct those issues in your code.)
英文:
Context:
I have a datacube with 3 variables (3D arrays, dims:time,y,x
). The datacube is too big to fit in memory so I chunk it with xarray/dask
. I want to apply a function to every cell in x,y
of every variable in my datacube.
Problem:
My method takes a long time to load only one cell (1 minute) and I have to do that 112200 times. I use a for loop
with dataset.variable.isel(x=i, y=j).values
to load a single 1D array from my variables. Is there a better way to do that ? Also, knowing my dataset is chunked, is there a way to do that in parallel for all the chunks at once ?
Code example:
# Setup
import xarray as xr
import numpy as np
# Create the dimensions
x = np.linspace(0, 99, 100)
y = np.linspace(0, 349, 350)
time = np.linspace(0, 299, 300)
# Create the dataset
xrds= xr.Dataset()
# Add the dimensions to the dataset
xrds['time'] = time
xrds['y'] = y
xrds['x'] = x
# Create the random data variables with chunking
chunksize = (10, 100, 100) # Chunk size for the variables
data_var1 = np.random.rand(len(time), len(y), len(x))
data_var2 = np.random.rand(len(time), len(y), len(x))
data_var3 = np.random.rand(len(time), len(y), len(x))
xrds['data_var1'] = (('time', 'y', 'x'), data_var1, {'chunks': chunksize})
xrds['data_var2'] = (('time', 'y', 'x'), data_var2, {'chunks': chunksize})
xrds['data_var3'] = (('time', 'y', 'x'), data_var3, {'chunks': chunksize})
#### ---- My Attempt ---- ####
# Iterate through all the variables in my dataset
for var_name, var_data in xrds.data_vars.items():
# if variable is 3D
if var_data.shape == (xrds.dims['time'], xrds.dims['y'], xrds.dims['x']):
# Iterate through every cell of the variable along the x and y axis only
for i in range(xrds.dims['y']):
for j in range(xrds.dims['x']):
# Load a single 1D cell into memory (len(cell) = len(time))
print(xrds.v.isel(y=i,x=j).values)
答案1
得分: 1
我发现显式迭代xarray比isel()
快大约10%。
示例:
for var_name, var_data in xrds.data_vars.items():
# 如果变量是3D
if var_data.shape == (xrds.dims['time'], xrds.dims['y'], xrds.dims['x']):
# 沿x和y轴迭代变量的每个单元格
for i_array in xrds['data_var1'].transpose('x', 'y', 'time'):
x_coordinate = i_array.x.item()
for cell in i_array.transpose('y', 'time'):
y_coordinate = cell.y.item()
# 对单元格执行操作
这需要17.38秒,而原始方法需要20.47秒。
附注:chunksize = (10, 100, 100)
这一行对我来说似乎很可疑。如果你想一次加载与整个time
轴对应的数组,那么块的大小应该设置为不需要查看多个块。似乎 chunksize = (len(time), 100, 100)
会更有效率。然而,我以这两种方式进行了基准测试,对于这个数据大小没有任何区别。但在您的更大问题上可能会有所不同。
英文:
I find that explicitly iterating over the xarray is faster than isel()
, by about 10%.
Example:
for var_name, var_data in xrds.data_vars.items():
# if variable is 3D
if var_data.shape == (xrds.dims['time'], xrds.dims['y'], xrds.dims['x']):
# Iterate through every cell of the variable along the x and y axis only
for i_array in xrds['data_var1'].transpose('x', 'y', 'time'):
x_coordinate = i_array.x.item()
for cell in i_array.transpose('y', 'time'):
y_coordinate = cell.y.item()
# Do something with cell
This takes 17.38s, versus 20.47s for the original.
PS: The line chunksize = (10, 100, 100)
seems very suspicious to me. It seems like if you want to load an array corresponding to the entire time
axis at once, the chunks should be set so that this doesn't require looking at multiple chunks. It seems like chunksize = (len(time), 100, 100)
would be more efficient. However, I benchmarked this both ways and it doesn't make a difference for this data size. May make a difference on your larger problem.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论