无法在磁盘上创建/保存/加载非常大的数组

huangapple go评论73阅读模式
英文:

Unable to create/save/load very large array on disk

问题

For learning purposes I want to create, save, and then load (into tensorflow.keras) a very large int array of lengths in the order of 10^10.

initially tried numpy but failed in the creation stage.

Example code:

x=np.ones((274576,200,200,1),dtype='int')

Then I used dask.array, succeeded in creating the array but saving it in hdf5 format itself takes lots of time and large memory.

Is there any alternates with better speed and space efficiency?

Example code:

import dask.array as da
x=da.ones((274576,200,200,1),dtype='float') # create
da.to_hdf5('x.hdf5',{'x':x}) # save
import h5py
y=h5py.File('x.hdf5','r')['x'][:] # read
print(y)

Also in this dask array format fitting them directly into model.fit gives out a warning.

Is there any conversion step that needs to be done to avoid this?

Warning message:

WARNING:tensorflow:Keras is training/fitting/evaluating on array-like data. Keras may not be optimized for this format, so if your input data format is supported by TensorFlow I/O (https://github.com/tensorflow/io) we recommend using that to load a Dataset instead.
英文:

For learning purposes I want to create ,save and then load(into tensorflow.keras) a very large int array of lengths in the order of 10^10.

initially tried numpy but failed in the creation stage.

Example code:

x=np.ones((274576,200,200,1),dtype='int')

Then i used dask.array,succeeded in creating the array but saving it in hdf5 format itself takes lots of time and large memory .

Is there any alternates with better speed and space efficiency?

Example code:

import dask.array as da
x=da.ones((274576,200,200,1),dtype='float')#create
da.to_hdf5('x.hdf5',{'x':x})#save
import h5py
y=h5py.File('x.hdf5','r')['x'][:]#read
print(y)

Also in this dask array format fitting them directly into model.fit gives out a warning.

Is there any conversion step that needs to be done to avoid this?

Warning message:

WARNING:tensorflow:Keras is training/fitting/evaluating on array-like data. Keras may not be optimized for this format, so if your input data format is supported by TensorFlow I/O (https://github.com/tensorflow/io) we recommend using that to load a Dataset instead.

答案1

得分: 1

以下是您要翻译的内容:

您真的有两个问题:1)关于HDF5存储和2)关于使用tensorflow model.fit()读取文件。我的回答涉及第一个问题。

在创建大型数据集时,使用HDF5时,重要的是使用分块存储并适当设置chunks参数。从您的帖子中不清楚您在使用da.to_hdf5()dask设置了哪些值。
使用h5py和分块存储,我能够将文件创建时间减少到6分钟,适用于您的示例。代码在末尾。

默认情况下,HDF5数据集存储是连续的。启用分块存储后,数据存储和访问以固定大小的块进行。分块具有性能影响。建议的分块大小应在10 KiB和1 MiB之间(对于较大的数据集更大)。此外,当从磁盘读取块中的任何元素时,将读取整个块。因此,“最佳”块形状取决于您将如何访问数据。

注意:此行 y=h5py.File('x.hdf5','r')['x'][:] 将整个数据集读入NumPy数组(总共42 Gib!)。我对它的工作感到惊讶。更好的方法是创建一个h5py数据集对象,如下所示:x_ds=h5py.File('x.hdf5','r')['x']。对于许多操作,该对象的行为类似于数组。当您需要一个数组对象时,您可以使用NumPy切片表示法来读取所需的数据(例如:x_ds[0:5,:,:,:])。

以下是创建具有shape=(274_576,200,200,1)数据集的HDF5文件的方法。请注意,它使用分块存储,并以与分块大小匹配的块添加数据。适当的chunks参数取决于您计划如何访问数据。我设置它,假设您将读取数据的“行”(在轴0上)。根据需要修改我的值以评估生成的文件的行为。

chunksize = (c0,c1,c2,c3) = (4,200,200,1)
nrows = 274_576
start = time.time()

with h5py.File('SO_76299271.h5','w') as h5f:
    h5f.create_dataset('x',shape=(nrows,200,200,1),dtype='int', chunks=chunksize) # 创建空的整数数据集

    for i in range(nrows//c0):
        arr = np.random.randint(0, 255, size=chunksize)
        h5f['x'][i*c0:(i*c0+c0)] = arr

with h5py.File('SO_76299271.h5','r') as h5f:
    x_ds = h5f['x'] # 创建一个h5py数据集对象

希望这有助于您的工作。

英文:

You really have 2 questions: 1) about HDF5 storage and 2) about reading the file with tensorflow model.fit(). My answer addresses the 1st question.

When creating large datasets with HDF5, it's important to use chunked storage and set the chunks parameter appropriately. It's not clear from your post what values dask sets when you use da.to_hdf5().
I was able to reduce file creation time to 6 mins for your example using h5py and chunked storage. Code at the end.

By default, HDF5 dataset storage is contiguous. When you enable chunked storage, data is stored and accessed in fixed-size chunks. Chunking has performance implications. The recommended chunk size should be between 10 KiB and 1 MiB (larger for larger datasets). Also, when any element in a chunk is accessed, the entire chunk is read from disk. So, the"best" chuck shape depends on how you will access the data.

Note: this line y=h5py.File('x.hdf5','r')['x'][:] will read the entire dataset into a NumPy array (all 42 Gib of it!). I'm surprised it works. It's better to create a h5py dataset object like this: x_ds=h5py.File('x.hdf5','r')['x']. For many operations, the object behaves like an array. When you need an array object, you can use NumPy slice notation to read the desired data (for example: x_ds[0:5,:,:,:]).

Here is a method to create a HDF5 file with a dataset of shape=(274_576,200,200,1). Note that it uses chunked storage, and data is added in blocks that match the chunk size. The appropriate chunks parameter depends on how you plan access the data. I set it assuming you will read "rows" of data (on axis 0). Modify my values to benchmark how the resulting file behaves.

chunksize = (c0,c1,c2,c3) = (4,200,200,1)
nrows = 274_576
start = time.time()

with h5py.File('SO_76299271.h5','w') as h5f:
    h5f.create_dataset('x',shape=(nrows,200,200,1),dtype='int', chunks=chunksize) # creates empty dataset of ints
    
    for i in range(nrows//c0):
        arr = np.random.randint(0, 255, size=chunksize)
        h5f['x'][i*c0:(i*c0+c0)] = arr

with h5py.File('SO_76299271.h5','r') as h5f:
    x_ds = h5f['x'] # creates a h5py dataset object

huangapple
  • 本文由 发表于 2023年5月21日 17:50:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76299271.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定