英文:
Unable to create/save/load very large array on disk
问题
For learning purposes I want to create, save, and then load (into tensorflow.keras
) a very large int
array of lengths in the order of 10^10
.
initially tried numpy
but failed in the creation stage.
Example code:
x=np.ones((274576,200,200,1),dtype='int')
Then I used dask.array
, succeeded in creating the array but saving it in hdf5 format itself takes lots of time and large memory.
Is there any alternates with better speed and space efficiency?
Example code:
import dask.array as da
x=da.ones((274576,200,200,1),dtype='float') # create
da.to_hdf5('x.hdf5',{'x':x}) # save
import h5py
y=h5py.File('x.hdf5','r')['x'][:] # read
print(y)
Also in this dask array format fitting them directly into model.fit
gives out a warning.
Is there any conversion step that needs to be done to avoid this?
Warning message:
WARNING:tensorflow:Keras is training/fitting/evaluating on array-like data. Keras may not be optimized for this format, so if your input data format is supported by TensorFlow I/O (https://github.com/tensorflow/io) we recommend using that to load a Dataset instead.
英文:
For learning purposes I want to create ,save and then load(into tensorflow.keras
) a very large int
array of lengths in the order of 10^10
.
initially tried numpy
but failed in the creation stage.
Example code:
x=np.ones((274576,200,200,1),dtype='int')
Then i used dask.array
,succeeded in creating the array but saving it in hdf5 format itself takes lots of time and large memory .
Is there any alternates with better speed and space efficiency?
Example code:
import dask.array as da
x=da.ones((274576,200,200,1),dtype='float')#create
da.to_hdf5('x.hdf5',{'x':x})#save
import h5py
y=h5py.File('x.hdf5','r')['x'][:]#read
print(y)
Also in this dask array format fitting them directly into model.fit
gives out a warning.
Is there any conversion step that needs to be done to avoid this?
Warning message:
WARNING:tensorflow:Keras is training/fitting/evaluating on array-like data. Keras may not be optimized for this format, so if your input data format is supported by TensorFlow I/O (https://github.com/tensorflow/io) we recommend using that to load a Dataset instead.
答案1
得分: 1
以下是您要翻译的内容:
您真的有两个问题:1)关于HDF5存储和2)关于使用tensorflow model.fit()
读取文件。我的回答涉及第一个问题。
在创建大型数据集时,使用HDF5时,重要的是使用分块存储并适当设置chunks
参数。从您的帖子中不清楚您在使用da.to_hdf5()
时dask
设置了哪些值。
使用h5py和分块存储,我能够将文件创建时间减少到6分钟,适用于您的示例。代码在末尾。
默认情况下,HDF5数据集存储是连续的。启用分块存储后,数据存储和访问以固定大小的块进行。分块具有性能影响。建议的分块大小应在10 KiB和1 MiB之间(对于较大的数据集更大)。此外,当从磁盘读取块中的任何元素时,将读取整个块。因此,“最佳”块形状取决于您将如何访问数据。
注意:此行 y=h5py.File('x.hdf5','r')['x'][:]
将整个数据集读入NumPy数组(总共42 Gib!)。我对它的工作感到惊讶。更好的方法是创建一个h5py数据集对象,如下所示:x_ds=h5py.File('x.hdf5','r')['x']
。对于许多操作,该对象的行为类似于数组。当您需要一个数组对象时,您可以使用NumPy切片表示法来读取所需的数据(例如:x_ds[0:5,:,:,:]
)。
以下是创建具有shape=(274_576,200,200,1)
数据集的HDF5文件的方法。请注意,它使用分块存储,并以与分块大小匹配的块添加数据。适当的chunks
参数取决于您计划如何访问数据。我设置它,假设您将读取数据的“行”(在轴0上)。根据需要修改我的值以评估生成的文件的行为。
chunksize = (c0,c1,c2,c3) = (4,200,200,1)
nrows = 274_576
start = time.time()
with h5py.File('SO_76299271.h5','w') as h5f:
h5f.create_dataset('x',shape=(nrows,200,200,1),dtype='int', chunks=chunksize) # 创建空的整数数据集
for i in range(nrows//c0):
arr = np.random.randint(0, 255, size=chunksize)
h5f['x'][i*c0:(i*c0+c0)] = arr
with h5py.File('SO_76299271.h5','r') as h5f:
x_ds = h5f['x'] # 创建一个h5py数据集对象
希望这有助于您的工作。
英文:
You really have 2 questions: 1) about HDF5 storage and 2) about reading the file with tensorflow model.fit()
. My answer addresses the 1st question.
When creating large datasets with HDF5, it's important to use chunked storage and set the chunks
parameter appropriately. It's not clear from your post what values dask
sets when you use da.to_hdf5()
.
I was able to reduce file creation time to 6 mins for your example using h5py and chunked storage. Code at the end.
By default, HDF5 dataset storage is contiguous. When you enable chunked storage, data is stored and accessed in fixed-size chunks. Chunking has performance implications. The recommended chunk size should be between 10 KiB and 1 MiB (larger for larger datasets). Also, when any element in a chunk is accessed, the entire chunk is read from disk. So, the"best" chuck shape depends on how you will access the data.
Note: this line y=h5py.File('x.hdf5','r')['x'][:]
will read the entire dataset into a NumPy array (all 42 Gib of it!). I'm surprised it works. It's better to create a h5py dataset object like this: x_ds=h5py.File('x.hdf5','r')['x']
. For many operations, the object behaves like an array. When you need an array object, you can use NumPy slice notation to read the desired data (for example: x_ds[0:5,:,:,:]
).
Here is a method to create a HDF5 file with a dataset of shape=(274_576,200,200,1)
. Note that it uses chunked storage, and data is added in blocks that match the chunk size. The appropriate chunks
parameter depends on how you plan access the data. I set it assuming you will read "rows" of data (on axis 0). Modify my values to benchmark how the resulting file behaves.
chunksize = (c0,c1,c2,c3) = (4,200,200,1)
nrows = 274_576
start = time.time()
with h5py.File('SO_76299271.h5','w') as h5f:
h5f.create_dataset('x',shape=(nrows,200,200,1),dtype='int', chunks=chunksize) # creates empty dataset of ints
for i in range(nrows//c0):
arr = np.random.randint(0, 255, size=chunksize)
h5f['x'][i*c0:(i*c0+c0)] = arr
with h5py.File('SO_76299271.h5','r') as h5f:
x_ds = h5f['x'] # creates a h5py dataset object
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论