英文:
Downloading Huge data more then 1gb using requests lib and saving it to a file using python
问题
从请求中读取大量数据并保存到文件中。这适用于小于1GB的数据,但对于1GB到5GB以上的数据,需要很长时间,而且我没有看到数据保存到文件中,会导致连接错误。
我尝试过的代码片段:
with request.get(url....) as r:
with open(file, 'wb') as f:
for chunk in r.iter_content(chunk_size=10000):
if chunk:
f.write(chunk)
f.flush()
这里是否有任何建议来加速下载过程并将其保存到文件中将会很有帮助。我尝试了不同的块大小和注释掉flush,但改进不大。
with request.get(url....) as r:
with open(file, 'wb') as f:
for chunk in r.iter_content(chunk_size=10000):
if chunk:
f.write(chunk)
f.flush()
这对小于1GB的数据有效,但对于1GB以上的数据,花费时间很长,同时会出现连接错误,源自我们使用requests获取数据的地方。
英文:
Reading huge data from requests and saving it to a file. This works for < 1G of data but for more than 1GB to 5GB it takes a huge time and I have not seen the data saved to a file which gives me connection errors.
Piece of Code I tried:
with request.get(url....) as r:
with open(file ,‘wb’) as f:
for chunk in r.iter_content(chunk_size = 10000):
if chunk:
f.write(chunk)
f.flush()
Any suggestions here to accelerate the download process to save it to a file will be helpful. I tried with different chunk size and commenting flush but not much improvement.
with request.get(url....) as r:
with open(file ,‘wb’) as f:
for chunk in r.iter_content(chunk_size = 10000):
if chunk:
f.write(chunk)
f.flush()
This gives results for less than 1GB of data but for more than 1 GB of data it takes huge time and gives an error of connection from the source from where we fetched the data using requests.
答案1
得分: 1
我认为最好的方法是进行并行下载。
步骤1:pip安装pypdl
步骤2:要下载文件,您可以使用以下代码:
from pypdl import Downloader
dl = Downloader()
dl.start('http://example.com/file.txt', 'file.txt')
作者:Jishnu
在源堆栈问题中有不同的选项。
源链接:https://stackoverflow.com/questions/58571343/downloading-a-large-file-in-parts-using-multiple-parallel-threads
英文:
I think the best approach is to do a parallel download
step 1: pip install pypdl
step 2: for downloading the file you could use
from pypdl import Downloader
dl = Downloader()
dl.start('http://example.com/file.txt', 'file.txt')
by: Jishnu
There are differents options in the source stack question
答案2
得分: 1
正如注释中所指出的,你需要在requests.get()
中传递stream=True
,否则会导致大量内存使用。你可能已经在做这个 - 从你的问题中不太清楚。
if chunk:
这一步是不必要的 - iter_content()
不会给你空的块。
f.flush()
这会减慢你的代码。它关闭缓冲并告诉Python在开始下一个写入之前完成上一个写入。将尽可能多的写入排队起来速度更快。
这也是不必要的。当with
块退出时,文件会被关闭,隐式刷新文件中的剩余写入。
出于这些原因,你应该删除这行代码。
英文:
As noted in the comments, you need to pass stream=True
to requests.get()
, or you'll end up with lots of memory use. You may be doing that already - it's not clear from your question.
if chunk:
This step isn't required - iter_content()
won't give you empty chunks.
f.flush()
This is slowing your code down. It turns off buffering and tells Python to finish the previous write before it begins the next write. It's much faster to queue up as many writes as possible.
It's also not required. When the with
block exits, the file is closed, which implicitly flushes the remaining writes in the file.
For those reasons, you should delete this line of code.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论