英文:
Mat73 is taking forever to open a MATLAB .mat file
问题
我正在使用mat73来打开远程服务器上大约1GB大小的.mat文件。当我在本地运行这些文件时,它会在不到10秒内加载文件,但是当我通过远程连接运行它们时,文件永远不会加载,需要超过2分钟。有没有关于这种情况发生的原因?是远程连接的问题吗?
我重置了conda环境,移除了mat73和依赖项。还尝试只使用h5py打开,但没有成功。
我尝试过的事情:
为了测试是否是网络连接速度的问题,我在集群上运行了speeftestCLI,返回了750mbit/sec和850mbit/sec(下载/上传)。下载了大约10GB的数据,用了大约15分钟。
当在本地运行MAT73和H5py时,一个2GB的文件分别需要7秒/0.5秒。但是当在远程连接到VScode上运行时,笔记本需要70多分钟(我不得不停止它,似乎无法工作)。
我认为这可能是一个jupyter/python/environment的问题。我重新安装了一切。尝试了python 3.9和3.10,但似乎没有解决我的问题。
我已经将问题缩小到mat73或其h5py依赖项的问题。当运行h5py.File('my file.mat')
或mat73.loadmat('my file.mat')
时,我会陷入一个无限循环,什么都不会发生。我尝试过在一个非常小的.mat文件上使用这两个函数(尽管没有保存为mat7.3),也花了很长时间。我认为这可能是一个包的问题。
英文:
I am using mat73 to open .mat
files on a remote server of about an average size of 1GB. When I run these files locally, it loads the files in <10seconds, however when I run them through a remote connection the files never load >2 minutes. Any idea why this occurs? Is it a remote connection problem?
I reset the conda env, removed mat73 and dependecies. Also tried to open with just h5py, and it didn't work.
What I've tried:
to test if it's a network connection speed problem: I ran speeftestCLI on the cluster got back 750mbit/sec, 850mbit/sec (down/up). downloaded 10 gigs of data in 15ish minutes.
When running MAT73 and H5py locally, a 2gb file will take 7s/0.5s respectively. While running it on my remote connection to VScode the notebook took 70+ minutes (I had to stop it, seemed like it wouldn't work).
I believe it might be a jupyter/python/environment problem. I reinstalled everything. Tried python 3.9 and 3.10. Nothing seems to fix my problem.
I've narrowed down my problem to either mat73 or its h5py dependency. When running h5py.File('my file.mat')
or mat73.loadmat('my file.mat')
I get an infinite cycle where nothing happens. I've tried both of these functions on a very small .mat file (though not saved as mat7.3) and it also took a very long time. I believe it might be an issue with the package.
答案1
得分: 1
在诊断类似这种问题时,通常更容易(而且通常更快)从一个已经工作的东西开始,然后逐渐扩展。以下是一个使用h5py创建小型HDF5文件、关闭它然后重新打开的非常简单的脚本。请在远程运行它,应该会立刻运行成功。
import h5py
with h5py.File('SO_75389309.h5','w') as h5w:
l1 = [i for i in range(100)]
h5w.create_dataset('test1', data=l1)
with h5py.File('SO_75389309.h5') as h5r:
print(h5r['test1'].shape, h5r['test1'].dtype)
输出应该是:
(100,) int32
如果它正常运行,请尝试使用更大的文件大小进行测试。可以增加 range()
中的值以创建更大的列表(或使用 np.array
),并创建更多数据集(例如 'test2','test3'等)。目标是创建一个大型的HDF5文件,以复制性能瓶颈。
如果这个小例子无法快速运行,那么问题可能出现在你的远程配置中(无论是笔记本、软件包版本还是服务器上的虚拟实例)。这将更难诊断。你说尝试了Python 3.9和3.10。你正在使用哪个软件包版本?可以使用以下代码获取它们:
import h5py
print(h5py.__version__)
英文:
When diagnosing problems like this, it's easier (and usually faster) to start with something that works, and expand from there. Here is a very simple script to create a small HDF5 file with h5py, close it and reopen. Run it remotely. It should run in an instant.
import h5py
with h5py.File('SO_75389309.h5','w') as h5w:
l1 = [i for i in range(100)]
h5w.create_dataset('test1',data=l1)
with h5py.File('SO_75389309.h5') as h5r:
print(h5r['test1'].shape, h5r['test1'].dtype)
Output should be:
(100,) int32
If it works, keep testing with larger file sizes. Increase the range()
to create a larger list (or use a np.array
) and create more datasets (eg, 'test2', 'test3', etc). The goal is to create a large HDF5 file that replicates the performance bottleneck.
If that small example does not run quickly, there's something in your remote configuration (either the notebook, package versions, or the virtual instance on the server). That will be harder to diagnose. You said you tried Python 3.9 and 3.10. What package versions are you using? You can get them with:
import h5py
print(h5py.__version__)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论