从HDF5文件中按创建顺序提取数据集

huangapple go评论63阅读模式
英文:

Extracting Datasets from HDF5 File in Order Created

问题

I have an HDF5 file I am trying to open with Python or MATLAB. The HDF5 file consists of several datasets all in the root folder, which were saved to the file in some order. I want to extract the datasets in the order they were written. I know that the order they were written is encoded in the HDF5 file, because when I open it with HDFView there is an "Object Ref" number associated with each dataset. These Object Ref IDs are lower for datasets that were written earlier / higher for datasets that are written later.

When I hope the file with Python (h5py package), the datasets are extracted in alphabetical order. I can't figure out any way to extract the Object Ref I see in HDFView to process in Python. Is there any way to extract the datasets in order in Python or MATLAB (or any other platform)?

This is the code I used in Python to get the datasets in alphabetical order

with h5py.File(file) as f:
    keys = f.keys()
    for k in keys: print(k)

I'm looking for a way to do something like this

with h5py.File(file) as f:
    keys = f.keys()
    object_refs = f.object_refs()
    indexes_in_sorted_order = object_refs.sorted_order() # pseudocode
    for i in indexes_in sorted_order: print(keys[i])

(我理解你不需要翻译代码部分,只需要翻译代码之外的内容,所以我只提供了翻译好的部分。)

英文:

I have an HDF5 file I am trying to open with Python or MATLAB. The HDF5 file consists of several datasets all in the root folder, which were saved to the file in some order. I want to extract the datasets in the order they were written. I know that the order they were written is encoded in the HDF5 file, because when I open it with HDFView there is an "Object Ref" number associated with each dataset. These Object Ref IDs are lower for datasets that were written earlier / higher for datasets that are written later.

When I hope the file with Python (h5py package), the datasets are extracted in alphabetical order. I can't figure out any way to extract the Object Ref I see in HDFView to process in Python. Is there any way to extract the datasets in order in Python or MATLAB (or any other platform)?

This is the code I used in Python to get the datasets in alphabetical order

with h5py.File(file) as f:       
        keys = f.keys()
        for k in keys: print(k)

I'm looking for a way to do something like this

with h5py.File(file) as f:       
        keys = f.keys()
        object_refs = f.object_refs()
        indexes_in_sorted_order = object_refs.sorted_order() # pseudocode
        for i in indexes_in_sorted_order: print(keys[i])

答案1

得分: 1

@Homer512 是正确的,h5py 没有一个获取该值的 API。 也就是说,您可能可以使用数据集的 "offset" 值。 我进行了一些有限的测试,针对不是按字母顺序创建的数据集。 偏移值似乎是根据创建顺序增加的。 要执行此操作,您必须使用引用 DataSetID 的低级 API。

以下是一个示例,创建了 6 个不按字母顺序的数据集。 创建后,它循环遍历数据集,创建了一个 [name]:offset 字典,然后根据值重新排序字典。 它在排序后的字典中循环遍历名称,以获取按偏移排序的数据集(如果您不关心偏移值,也可以创建数据集名称的排序列表)。

注意:如果您经常要执行此操作,建议将创建时间添加为数据集属性。

请参见下面的代码:

ds_names = ['alpha', 'zebra', 'bravo', 'yankee', 'charlie', 'xray']
cnt = 1
with h5py.File('SO_75624797.h5','w') as h5f:
    for name in ds_names:
        h5f.create_dataset(name, data=np.arange(cnt,cnt+10))
        cnt += 10

offset_dict = {}    
with h5py.File('SO_75624797.h5') as h5f:
    for dset in h5f:
        print(f"for dset: {dset}, Offset: {h5f[dset].id.get_offset()}")
        offset_dict[dset] = h5f[dset].id.get_offset()
        
    offset_dict = {k: v for k, v in sorted(offset_dict.items(), key=lambda item: item[1])}

    print('')
    for dset in offset_dict:
        print(f"for dset: {dset}, Offset: {h5f[dset].id.get_offset()}")

希望对您有所帮助。

英文:

@Homer512 is correct, h5py doesn't have an API to get that value. That said, you might be able to use the dataset's "offset" value. I did some limited testing for datasets that are NOT created in alphabetical order. The offset values appear to increase based on order of creation. To do this you have to use a low level API that references the DataSetID.

Here is an example that creates 6 datasets that are not in alphabetical order. after creating, it loops over the datasets, creates a dictionary of [name]:offset, then reorders the dictionary based on the value. It loops over the names in the sorted dictionary to get the datasets in offset order. (You could also create a sorted list of the dataset names if you're not interested in the offset value.)

Note: If you are going to do this frequently, I suggest adding creation time as a dataset attribute.

See code below:

ds_names = ['alpha', 'zebra', 'bravo', 'yankee', 'charlie', 'xray'] 
cnt = 1
with h5py.File('SO_75624797.h5','w') as h5f:
    for name in ds_names:
        h5f.create_dataset(name, data=np.arange(cnt,cnt+10))
        cnt += 10

offset_dict = {}    
with h5py.File('SO_75624797.h5') as h5f:
    for dset in h5f:
        print(f"for dset: {dset}, Offset: {h5f[dset].id.get_offset()}")
        offset_dict[dset] = h5f[dset].id.get_offset()
        
    offset_dict = {k: v for k, v in sorted(offset_dict.items(), key=lambda item: item[1])}

    print('')
    for dset in offset_dict:
        print(f"for dset: {dset}, Offset: {h5f[dset].id.get_offset()}")

huangapple
  • 本文由 发表于 2023年3月7日 05:18:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/75655945.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定