使用Python的requests库下载超过1GB的大型数据并将其保存到文件中。

huangapple go评论81阅读模式
英文:

Downloading Huge data more then 1gb using requests lib and saving it to a file using python

问题

从请求中读取大量数据并保存到文件中。这适用于小于1GB的数据,但对于1GB到5GB以上的数据,需要很长时间,而且我没有看到数据保存到文件中,会导致连接错误。

我尝试过的代码片段:

with request.get(url....) as r:
   with open(file, 'wb') as f:
      for chunk in r.iter_content(chunk_size=10000):
         if chunk:
            f.write(chunk)
            f.flush()

这里是否有任何建议来加速下载过程并将其保存到文件中将会很有帮助。我尝试了不同的块大小和注释掉flush,但改进不大。

with request.get(url....) as r:
   with open(file, 'wb') as f:
      for chunk in r.iter_content(chunk_size=10000):
         if chunk:
            f.write(chunk)
            f.flush()

这对小于1GB的数据有效,但对于1GB以上的数据,花费时间很长,同时会出现连接错误,源自我们使用requests获取数据的地方。

英文:

Reading huge data from requests and saving it to a file. This works for < 1G of data but for more than 1GB to 5GB it takes a huge time and I have not seen the data saved to a file which gives me connection errors.

Piece of Code I tried:

with request.get(url....) as r:
   with open(file ,‘wb’) as f:
      for chunk in r.iter_content(chunk_size = 10000):
         if chunk:
            f.write(chunk)
            f.flush()

Any suggestions here to accelerate the download process to save it to a file will be helpful. I tried with different chunk size and commenting flush but not much improvement.

with request.get(url....) as r:
   with open(file ,‘wb’) as f:
      for chunk in r.iter_content(chunk_size = 10000):
         if chunk:
            f.write(chunk)
            f.flush()

This gives results for less than 1GB of data but for more than 1 GB of data it takes huge time and gives an error of connection from the source from where we fetched the data using requests.

答案1

得分: 1

我认为最好的方法是进行并行下载。

步骤1:pip安装pypdl

步骤2:要下载文件,您可以使用以下代码:

from pypdl import Downloader

dl = Downloader()
dl.start('http://example.com/file.txt', 'file.txt')

作者:Jishnu

在源堆栈问题中有不同的选项。

源链接:https://stackoverflow.com/questions/58571343/downloading-a-large-file-in-parts-using-multiple-parallel-threads

英文:

I think the best approach is to do a parallel download

step 1: pip install pypdl

step 2: for downloading the file you could use

from pypdl import Downloader

dl = Downloader()
dl.start(&#39;http://example.com/file.txt&#39;, &#39;file.txt&#39;)

by: Jishnu

There are differents options in the source stack question

source: https://stackoverflow.com/questions/58571343/downloading-a-large-file-in-parts-using-multiple-parallel-threads

答案2

得分: 1

正如注释中所指出的,你需要在requests.get()中传递stream=True,否则会导致大量内存使用。你可能已经在做这个 - 从你的问题中不太清楚。

if chunk:

这一步是不必要的 - iter_content() 不会给你空的块。

f.flush()

这会减慢你的代码。它关闭缓冲并告诉Python在开始下一个写入之前完成上一个写入。将尽可能多的写入排队起来速度更快。

这也是不必要的。当with块退出时,文件会被关闭,隐式刷新文件中的剩余写入。

出于这些原因,你应该删除这行代码。

英文:

As noted in the comments, you need to pass stream=True to requests.get(), or you'll end up with lots of memory use. You may be doing that already - it's not clear from your question.


if chunk:

This step isn't required - iter_content() won't give you empty chunks.


f.flush()

This is slowing your code down. It turns off buffering and tells Python to finish the previous write before it begins the next write. It's much faster to queue up as many writes as possible.

It's also not required. When the with block exits, the file is closed, which implicitly flushes the remaining writes in the file.

For those reasons, you should delete this line of code.

huangapple
  • 本文由 发表于 2023年5月28日 23:30:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76352234.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定