2023年5月28日 23:30:59go评论177阅读模式

英文:

Downloading Huge data more then 1gb using requests lib and saving it to a file using python

问题

从请求中读取大量数据并保存到文件中。这适用于小于1GB的数据，但对于1GB到5GB以上的数据，需要很长时间，而且我没有看到数据保存到文件中，会导致连接错误。

我尝试过的代码片段：

with request.get(url....) as r:
   with open(file, 'wb') as f:
      for chunk in r.iter_content(chunk_size=10000):
         if chunk:
            f.write(chunk)
            f.flush()

这里是否有任何建议来加速下载过程并将其保存到文件中将会很有帮助。我尝试了不同的块大小和注释掉flush，但改进不大。

with request.get(url....) as r:
   with open(file, 'wb') as f:
      for chunk in r.iter_content(chunk_size=10000):
         if chunk:
            f.write(chunk)
            f.flush()

这对小于1GB的数据有效，但对于1GB以上的数据，花费时间很长，同时会出现连接错误，源自我们使用requests获取数据的地方。

英文:

Reading huge data from requests and saving it to a file. This works for < 1G of data but for more than 1GB to 5GB it takes a huge time and I have not seen the data saved to a file which gives me connection errors.

Piece of Code I tried:

with request.get(url....) as r:
   with open(file ,‘wb’) as f:
      for chunk in r.iter_content(chunk_size = 10000):
         if chunk:
            f.write(chunk)
            f.flush()

Any suggestions here to accelerate the download process to save it to a file will be helpful. I tried with different chunk size and commenting flush but not much improvement.

with request.get(url....) as r:
   with open(file ,‘wb’) as f:
      for chunk in r.iter_content(chunk_size = 10000):
         if chunk:
            f.write(chunk)
            f.flush()

This gives results for less than 1GB of data but for more than 1 GB of data it takes huge time and gives an error of connection from the source from where we fetched the data using requests.

答案1

得分: 1

我认为最好的方法是进行并行下载。

步骤1：pip安装pypdl

步骤2：要下载文件，您可以使用以下代码：

from pypdl import Downloader

dl = Downloader()
dl.start('http://example.com/file.txt', 'file.txt')

作者：Jishnu

在源堆栈问题中有不同的选项。

源链接：https://stackoverflow.com/questions/58571343/downloading-a-large-file-in-parts-using-multiple-parallel-threads

英文:

I think the best approach is to do a parallel download

step 1: pip install pypdl

step 2: for downloading the file you could use

from pypdl import Downloader

dl = Downloader()
dl.start(&#39;http://example.com/file.txt&#39;, &#39;file.txt&#39;)

by: Jishnu

There are differents options in the source stack question

source: https://stackoverflow.com/questions/58571343/downloading-a-large-file-in-parts-using-multiple-parallel-threads

答案2

得分: 1

正如注释中所指出的，你需要在requests.get()中传递stream=True，否则会导致大量内存使用。你可能已经在做这个 - 从你的问题中不太清楚。

if chunk:

这一步是不必要的 - iter_content() 不会给你空的块。

f.flush()

这会减慢你的代码。它关闭缓冲并告诉Python在开始下一个写入之前完成上一个写入。将尽可能多的写入排队起来速度更快。

这也是不必要的。当with块退出时，文件会被关闭，隐式刷新文件中的剩余写入。

出于这些原因，你应该删除这行代码。

英文:

As noted in the comments, you need to pass stream=True to requests.get(), or you'll end up with lots of memory use. You may be doing that already - it's not clear from your question.

if chunk:

This step isn't required - iter_content() won't give you empty chunks.

f.flush()

This is slowing your code down. It turns off buffering and tells Python to finish the previous write before it begins the next write. It's much faster to queue up as many writes as possible.

It's also not required. When the with block exits, the file is closed, which implicitly flushes the remaining writes in the file.

For those reasons, you should delete this line of code.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Python的requests库下载超过1GB的大型数据并将其保存到文件中。

问题

答案1

答案2

KeyError 即使步骤没有输入且针对匹配情况存在键。

Slider的标题不可见 (VPython)

Django – 安装 Angular 和 React 后出现 ImportError。

A pandera DataFrame Schema with special characters in column names

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论