英文:
HDF5 files and plotting using chunks
问题
Sure, here's the translated version of your content without the code parts:
我对HDF5文件还很陌生,不太明白如何访问数据集中的块。
我有一个相当大的数据集(1536,2048,11,18,2),被分块成(768,1024,1,1,1),每个块代表了图像的一半。
我想绘制数据集,以显示每个(完整的)图像的均值(使用matplotlib)。
问题:我如何访问单独的块以及如何处理它们?(h5py如何使用它们?)
这是我的代码:
bla = np.random.randint(0,100, (1536, 2048, 11, 18, 2))
with h5py.File('test.h5','w') as f:
grp = f.create_group('Measurement 1')
grp.create_dataset('data', data = bla, chunks = (768,1024,1,1,1))
f.close()
我有以下代码以访问数据集,但我不知道如何访问块:
with h5py.File('test.h5', 'r') as hf:
for dset in hf['Measurement 1'].keys():
print (dset)
ds_hf = hf['Measurement 1']['data'] # 返回HDF5数据集对象
print (ds_hf)
print (ds_hf.shape, ds_hf.dtype)
data_f = hf['Measurement 1']['data'][:] # 添加[:]返回一个numpy数组
hf.close()
我需要程序打开每个块,获取均值,然后再关闭它,以免我的RAM占用过多空间。
英文:
I'm new to HDF5 files and I don't understand how to access chunks in a dataset.
I have quite a big dataset (1536, 2048, 11, 18, 2) which is chunked into (768, 1024, 1,1,1), each chunk represents half of an image.
I want to plot the dataset, giving the mean values of each (whole) image (using matplotlib).
Question: how to I access separate chunks and how do I work with them? (How does h5py use them?)
This is my code:
bla = np.random.randint(0,100, (1536, 2048, 11, 18, 2))
with h5py.File('test.h5','w') as f:
grp = f.create_group('Measurement 1')
grp.create_dataset('data', data = bla, chunks = (768,1024,1,1,1))
f.close()
I have this to get access to the dataset, but I don't know how to access the chunks..
with h5py.File('test.h5', 'r') as hf:
for dset in hf['Measurement 1'].keys():
print (dset)
ds_hf = hf['Measurement 1']['data'] # returns HDF5 dataset object
print (ds_hf)
print (ds_hf.shape, ds_hf.dtype)
data_f = hf['Measurement 1']['data'][:] # adding [:] returns a numpy array
hf.close()
I need the program to open each chunk, get the mean value and close it again before opening the next one, so my RAM doesn't get full.
答案1
得分: 1
这是一个示例代码,您可以了解在HDF5中如何使用数据块(chunks)。我以一种通用的方式开发了它,您可以根据您的需求进行修改:
import h5py
import numpy as np
# 生成随机数据
bla = np.random.randint(0, 100, (1536, 2048, 11, 18, 2))
# 创建HDF5文件和数据集
with h5py.File('test.h5', 'w') as f:
grp = f.create_group('Measurement 1')
grp.create_dataset('data', data=bla, chunks=(768, 1024, 1, 1, 1))
# 打开HDF5文件
with h5py.File('test.h5', 'r') as hf:
# 访问数据集
ds_hf = hf['Measurement 1']['data']
print(ds_hf)
print(ds_hf.shape, ds_hf.dtype)
# 遍历数据块
for chunk_idx in np.ndindex(ds_hf.chunks):
chunk = ds_hf[chunk_idx]
# 处理数据块
chunk_mean = np.mean(chunk)
print(f"数据块 {chunk_idx}: 均值 = {chunk_mean}")
# 关闭HDF5文件
hf.close()
英文:
Here is a sample code that you can understand how chunks work in hdf5, I developed it in a general way, you can modify it based on you requirements:
import h5py
import numpy as np
# Generate random data
bla = np.random.randint(0, 100, (1536, 2048, 11, 18, 2))
# Create the HDF5 file and dataset
with h5py.File('test.h5', 'w') as f:
grp = f.create_group('Measurement 1')
grp.create_dataset('data', data=bla, chunks=(768, 1024, 1, 1, 1))
# Open the HDF5 file
with h5py.File('test.h5', 'r') as hf:
# Access the dataset
ds_hf = hf['Measurement 1']['data']
print(ds_hf)
print(ds_hf.shape, ds_hf.dtype)
# Iterate over the chunks
for chunk_idx in np.ndindex(ds_hf.chunks):
chunk = ds_hf[chunk_idx]
# Process the chunk
chunk_mean = np.mean(chunk)
print(f"Chunk {chunk_idx}: Mean value = {chunk_mean}")
# Close the HDF5 file
hf.close()
答案2
得分: 1
Chunks are used to optimize I/O performance. HDF5 (and h5py) write/read data in chunked blocks (1 chunk at a time). This is handled in the background, and you do not have to worry about the chunking mechanism. The chunk size/shape is defined when you create the dataset, and cannot be changed. If you need to change it, there are HDF5 utilities to do this.
When reading data you don't have to worry about chunksize (in general). **See comments at end for more details. Use Numpy slice notation to read the desired slice, and h5py/HDF5 will read for you. YOU DO NOT HAVE TO WRITE YOUR CODE TO READ EXACTLY 1 CHUNK AT A TIME.
Assuming axis 0 is the image index, the code below will read each image array to the image object (as a numpy array). It's much easier and cleaner than working with the chunk objects.
with h5py.File('test.h5', 'r') as hf:
ds_hf = hf['Measurement 1']['data'] # returns HDF5 dataset objects
print(ds_hf.shape)
for i in range(len(ds_hf.shape[0])):
image = ds_hf[i] # this returns numpy array for image i
Although you don't have to worry about chunk size to read and write data, it's important to set an appropriate size for your use. That discussion goes beyond your question. Your size is good for your application.
英文:
Chunks are used to optimize I/O performance. HDF5 (and h5py) write/read data in chunked blocks (1 chunk at a time). This is handled in the background, and you do not have to worry about the chunking mechanism. The chunk size/shape is defined when you create the dataset, and cannot be changed. If you need to change it, there are HDF5 utilities to do this.
When reading data you don't have to worry about chunksize (in general). **See comments at end for more details. Use Numpy slice notation to read the desired slice, and h5py/HDF5 will read for you. YOU DO NOT HAVE TO WRITE YOUR CODE TO READ EXACTLY 1 CHUNK AT A TIME.
Assuming axis 0 is the image index, the code below will read each image array to the image object (as a numpy array). It's much easier and cleaner than working with the chunk objects.
with h5py.File('test.h5', 'r') as hf:
ds_hf = hf['Measurement 1']['data'] # returns HDF5 dataset objects
print(ds_hf.shape)
for i in range(len(ds_hf.shape[0]):
image = ds_hf[i] # this returns numpy array for image i
Although you don't have to worry about chunk size to read and write data, it's important to set an appropriate size for your use. That discussion goes beyond your question. Your size is good for your application.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论