Xarray的open_mfdataset()函数是否能处理嵌套结构中可变数量的文件?

huangapple go评论66阅读模式
英文:

Can Xarray's open_mfdataset() function work with variable number of files in the nested structure?

问题

我正在尝试使用Xarray的open_mfdataset()函数来打开大量的时空文件。然而,尽管层次结构的某些级别具有不同数量的文件,但它们结果具有相同的维度。

想象一下,我要处理的文件结构如下所示:

[
    [r1_2000_2050.nc, r1_2050_2100.nc],
    [r2_2000_2025.nc, r2_2025_2050.nc, r2_2050_2100.nc]
]

所有维度都匹配,空间维度相同,尽管第二个子列表具有更多的文件,但时间维度仍然从2000年到2100年。我已经确认可以通过手动的一系列xarray合并和连接来组合这些文件,但我希望利用open_mfdataset的并行加载和分块结构,以便不将所有内容加载到内存中。

当我尝试使用以下方式加载这个结构时:

xr.open_mfdataset(nested_paths, combine='nested', concat_dim=['realization', 'time'])

我收到以下错误消息:

ValueError: The supplied objects do not form a hypercube because sub-lists do not have consistent lengths along dimension0

如果使用xarray可以实现这个目标,那将非常有益。

英文:

I am attempting to use Xarray's open_mfdataset() function to open a large number of spatiotemporal files. However, some levels of the hierarchy have different numbers of files even though they result in the same dimensions.

Imagine the file structure that I want to process looks like this:

[
    [r1_2000_2050.nc, r1_2050_2100.nc],
    [r2_2000_2025.nc, r2_2025_2050.nc, r2_2050_2100.nc]
]

All of the dimensions do match, the spatial dimensions are the same and though the second sublist has more files, the temporal dimensions still run from 2000-2100. I have confirmed that I can combine these files through a manual series of xarray merges and concats but I want to take advantage of open_mfdataset's parallel loading and chunking structure so that I don't load everything into memory.

When I try to load this structure with:

xr.open_mfdataset(nested_paths, combine='nested', concat_dim=['realization', 'time'])

I get this error:
ValueError: The supplied objects do not form a hypercube because sub-lists do not have consistent lengths along dimension0

If this is possible to do with xarray, that would be extremely beneficial.

答案1

得分: 1

抱歉,Xarray目前不支持这个功能。(我认为你甚至不能使用Kerchunk来绕过这个问题,因为它会涉及到“不规则”的长度块。)

Xarray不支持这个功能的原因是它会破坏维度之间的对称性。在你的示例中,沿着'time'然后沿着'realization'进行连接是有明确定义的,但沿着'realization'然后'time'进行连接会导致维度不匹配。对于combine目前支持的情况,我们可以保证任何顺序都可以工作。

我们可以想象在Xarray中放宽这个约束,这样当dim=['time', 'realization']时,combine='nested'会成功,但如果dim=['realization', 'time']则会失败。

如果这是你想在Xarray中看到的功能,欢迎您以新功能的形式帮助贡献它 Xarray的open_mfdataset()函数是否能处理嵌套结构中可变数量的文件? 但这不是我们很可能会立即实现的功能。 (如果您想尝试实现它,我建议从在这里禁用异常开始,并查看代码的执行进展如何。)

英文:

Unfortunately Xarray doesn't currently support this. (I don't think you could even use Kerchunk to get around this, because it would imply "ragged"-length chunks.)

The reason xarray doesn't support this is because it would break the symmetry between dimensions. In your example concatenating along 'time' then along 'realization' would be well-defined, but concatenating along 'realization' then 'time' would have a dimension mismatch. For the cases that combine supports right now, we can guarantee that either order would work.

We could perhaps imagine relaxing this constraint in xarray, so that combine='nested' would succeed in this case if dim=['time', 'realization'], but fail if dim=['realization', 'time'].

If this is something you would like to see in xarray then you are welcome to help contribute it as a new feature Xarray的open_mfdataset()函数是否能处理嵌套结构中可变数量的文件? But it's not something we are likely to prioritize implementing soon. (If you want to try implementing it I would start by disabling the exceptions here and see how much further through the code it gets.)

huangapple
  • 本文由 发表于 2023年7月4日 22:15:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76613537.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定