2023年5月21日 17:50:16go评论165阅读模式

英文:

Unable to create/save/load very large array on disk

问题

For learning purposes I want to create, save, and then load (into tensorflow.keras) a very large int array of lengths in the order of 10^10.

initially tried numpy but failed in the creation stage.

Example code:

x=np.ones((274576,200,200,1),dtype='int')

Then I used dask.array, succeeded in creating the array but saving it in hdf5 format itself takes lots of time and large memory.

Is there any alternates with better speed and space efficiency?

Example code:

import dask.array as da
x=da.ones((274576,200,200,1),dtype='float') # create
da.to_hdf5('x.hdf5',{'x':x}) # save
import h5py
y=h5py.File('x.hdf5','r')['x'][:] # read
print(y)

Also in this dask array format fitting them directly into model.fit gives out a warning.

Is there any conversion step that needs to be done to avoid this?

Warning message:

WARNING:tensorflow:Keras is training/fitting/evaluating on array-like data. Keras may not be optimized for this format, so if your input data format is supported by TensorFlow I/O (https://github.com/tensorflow/io) we recommend using that to load a Dataset instead.

英文:

For learning purposes I want to create ,save and then load(into tensorflow.keras) a very large int array of lengths in the order of 10^10.

initially tried numpy but failed in the creation stage.

Example code:

x=np.ones((274576,200,200,1),dtype=&#39;int&#39;)

Then i used dask.array,succeeded in creating the array but saving it in hdf5 format itself takes lots of time and large memory .

Is there any alternates with better speed and space efficiency?

Example code:

import dask.array as da
x=da.ones((274576,200,200,1),dtype=&#39;float&#39;)#create
da.to_hdf5(&#39;x.hdf5&#39;,{&#39;x&#39;:x})#save
import h5py
y=h5py.File(&#39;x.hdf5&#39;,&#39;r&#39;)[&#39;x&#39;][:]#read
print(y)

Also in this dask array format fitting them directly into model.fit gives out a warning.

Is there any conversion step that needs to be done to avoid this?

Warning message:

WARNING:tensorflow:Keras is training/fitting/evaluating on array-like data. Keras may not be optimized for this format, so if your input data format is supported by TensorFlow I/O (https://github.com/tensorflow/io) we recommend using that to load a Dataset instead.

答案1

得分: 1

以下是您要翻译的内容：

您真的有两个问题：1）关于HDF5存储和2）关于使用tensorflow model.fit()读取文件。我的回答涉及第一个问题。

在创建大型数据集时，使用HDF5时，重要的是使用分块存储并适当设置chunks参数。从您的帖子中不清楚您在使用da.to_hdf5()时dask设置了哪些值。
使用h5py和分块存储，我能够将文件创建时间减少到6分钟，适用于您的示例。代码在末尾。

默认情况下，HDF5数据集存储是连续的。启用分块存储后，数据存储和访问以固定大小的块进行。分块具有性能影响。建议的分块大小应在10 KiB和1 MiB之间（对于较大的数据集更大）。此外，当从磁盘读取块中的任何元素时，将读取整个块。因此，“最佳”块形状取决于您将如何访问数据。

注意：此行 y=h5py.File('x.hdf5','r')['x'][:] 将整个数据集读入NumPy数组（总共42 Gib！）。我对它的工作感到惊讶。更好的方法是创建一个h5py数据集对象，如下所示：x_ds=h5py.File('x.hdf5','r')['x']。对于许多操作，该对象的行为类似于数组。当您需要一个数组对象时，您可以使用NumPy切片表示法来读取所需的数据（例如：x_ds[0:5,:,:,:]）。

以下是创建具有shape=(274_576,200,200,1)数据集的HDF5文件的方法。请注意，它使用分块存储，并以与分块大小匹配的块添加数据。适当的chunks参数取决于您计划如何访问数据。我设置它，假设您将读取数据的“行”（在轴0上）。根据需要修改我的值以评估生成的文件的行为。

chunksize = (c0,c1,c2,c3) = (4,200,200,1)
nrows = 274_576
start = time.time()

with h5py.File('SO_76299271.h5','w') as h5f:
    h5f.create_dataset('x',shape=(nrows,200,200,1),dtype='int', chunks=chunksize) # 创建空的整数数据集

    for i in range(nrows//c0):
        arr = np.random.randint(0, 255, size=chunksize)
        h5f['x'][i*c0:(i*c0+c0)] = arr

with h5py.File('SO_76299271.h5','r') as h5f:
    x_ds = h5f['x'] # 创建一个h5py数据集对象

希望这有助于您的工作。

英文:

You really have 2 questions: 1) about HDF5 storage and 2) about reading the file with tensorflow model.fit(). My answer addresses the 1st question.

When creating large datasets with HDF5, it's important to use chunked storage and set the chunks parameter appropriately. It's not clear from your post what values dask sets when you use da.to_hdf5().
I was able to reduce file creation time to 6 mins for your example using h5py and chunked storage. Code at the end.

By default, HDF5 dataset storage is contiguous. When you enable chunked storage, data is stored and accessed in fixed-size chunks. Chunking has performance implications. The recommended chunk size should be between 10 KiB and 1 MiB (larger for larger datasets). Also, when any element in a chunk is accessed, the entire chunk is read from disk. So, the"best" chuck shape depends on how you will access the data.

Note: this line y=h5py.File('x.hdf5','r')['x'][:] will read the entire dataset into a NumPy array (all 42 Gib of it!). I'm surprised it works. It's better to create a h5py dataset object like this: x_ds=h5py.File('x.hdf5','r')['x']. For many operations, the object behaves like an array. When you need an array object, you can use NumPy slice notation to read the desired data (for example: x_ds[0:5,:,:,:]).

Here is a method to create a HDF5 file with a dataset of shape=(274_576,200,200,1). Note that it uses chunked storage, and data is added in blocks that match the chunk size. The appropriate chunks parameter depends on how you plan access the data. I set it assuming you will read "rows" of data (on axis 0). Modify my values to benchmark how the resulting file behaves.

chunksize = (c0,c1,c2,c3) = (4,200,200,1)
nrows = 274_576
start = time.time()

with h5py.File(&#39;SO_76299271.h5&#39;,&#39;w&#39;) as h5f:
    h5f.create_dataset(&#39;x&#39;,shape=(nrows,200,200,1),dtype=&#39;int&#39;, chunks=chunksize) # creates empty dataset of ints
    
    for i in range(nrows//c0):
        arr = np.random.randint(0, 255, size=chunksize)
        h5f[&#39;x&#39;][i*c0:(i*c0+c0)] = arr

with h5py.File(&#39;SO_76299271.h5&#39;,&#39;r&#39;) as h5f:
    x_ds = h5f[&#39;x&#39;] # creates a h5py dataset object

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

无法在磁盘上创建/保存/加载非常大的数组

问题

答案1

为什么chromedriver 117不能支持Chrome 117？

在Python中，字典键的组合的字典值的乘积。

如何在Python中移除多个连续的重复字符序列

如何在Python中按多个条件对项目进行排序？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论