英文:
Can Xarray's open_mfdataset() function work with variable number of files in the nested structure?
问题
我正在尝试使用Xarray的open_mfdataset()函数来打开大量的时空文件。然而,尽管层次结构的某些级别具有不同数量的文件,但它们结果具有相同的维度。
想象一下,我要处理的文件结构如下所示:
[
[r1_2000_2050.nc, r1_2050_2100.nc],
[r2_2000_2025.nc, r2_2025_2050.nc, r2_2050_2100.nc]
]
所有维度都匹配,空间维度相同,尽管第二个子列表具有更多的文件,但时间维度仍然从2000年到2100年。我已经确认可以通过手动的一系列xarray合并和连接来组合这些文件,但我希望利用open_mfdataset的并行加载和分块结构,以便不将所有内容加载到内存中。
当我尝试使用以下方式加载这个结构时:
xr.open_mfdataset(nested_paths, combine='nested', concat_dim=['realization', 'time'])
我收到以下错误消息:
ValueError: The supplied objects do not form a hypercube because sub-lists do not have consistent lengths along dimension0
如果使用xarray可以实现这个目标,那将非常有益。
英文:
I am attempting to use Xarray's open_mfdataset() function to open a large number of spatiotemporal files. However, some levels of the hierarchy have different numbers of files even though they result in the same dimensions.
Imagine the file structure that I want to process looks like this:
[
[r1_2000_2050.nc, r1_2050_2100.nc],
[r2_2000_2025.nc, r2_2025_2050.nc, r2_2050_2100.nc]
]
All of the dimensions do match, the spatial dimensions are the same and though the second sublist has more files, the temporal dimensions still run from 2000-2100. I have confirmed that I can combine these files through a manual series of xarray merges and concats but I want to take advantage of open_mfdataset's parallel loading and chunking structure so that I don't load everything into memory.
When I try to load this structure with:
xr.open_mfdataset(nested_paths, combine='nested', concat_dim=['realization', 'time'])
I get this error:
ValueError: The supplied objects do not form a hypercube because sub-lists do not have consistent lengths along dimension0
If this is possible to do with xarray, that would be extremely beneficial.
答案1
得分: 1
抱歉,Xarray目前不支持这个功能。(我认为你甚至不能使用Kerchunk来绕过这个问题,因为它会涉及到“不规则”的长度块。)
Xarray不支持这个功能的原因是它会破坏维度之间的对称性。在你的示例中,沿着'time'
然后沿着'realization'
进行连接是有明确定义的,但沿着'realization'
然后'time'
进行连接会导致维度不匹配。对于combine
目前支持的情况,我们可以保证任何顺序都可以工作。
我们可以想象在Xarray中放宽这个约束,这样当dim=['time', 'realization']
时,combine='nested'
会成功,但如果dim=['realization', 'time']
则会失败。
如果这是你想在Xarray中看到的功能,欢迎您以新功能的形式帮助贡献它 但这不是我们很可能会立即实现的功能。 (如果您想尝试实现它,我建议从在这里禁用异常开始,并查看代码的执行进展如何。)
英文:
Unfortunately Xarray doesn't currently support this. (I don't think you could even use Kerchunk to get around this, because it would imply "ragged"-length chunks.)
The reason xarray doesn't support this is because it would break the symmetry between dimensions. In your example concatenating along 'time'
then along 'realization'
would be well-defined, but concatenating along 'realization'
then 'time'
would have a dimension mismatch. For the cases that combine
supports right now, we can guarantee that either order would work.
We could perhaps imagine relaxing this constraint in xarray, so that combine='nested'
would succeed in this case if dim=['time', 'realization']
, but fail if dim=['realization', 'time']
.
If this is something you would like to see in xarray then you are welcome to help contribute it as a new feature But it's not something we are likely to prioritize implementing soon. (If you want to try implementing it I would start by disabling the exceptions here and see how much further through the code it gets.)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论