问题

我想要以流的方式处理一个非常大的文件（几个太字节）。这个文件可以通过HTTP协议访问，URL如下：

http://example.com/some-file

可以使用以下命令来实现这个目标：

wget -q -O - http://some-host/some-file | process-command

但是如果连接中断，我必须从头开始处理。

由于文件太大，无法使用wget -c来存储文件。

是否有另一条命令可以流式传输远程文件，并在内部处理所需的重新连接？

英文:

I want to process a very large file (a few terabytes) as a stream. This file is accessible via http protocol and from a URL such as:

http://example.com/some-file

This command can do that:

wget -q -O - http://some-host/some-file | process-command

but if connection is lost, I must begin the process from the beginning.

wget -c cannot be used because I can't store the file due to large size.

Is there another command that can stream the remote file, while handling the required reconnects internally?

答案1

得分: 1

以下是翻译好的部分：

解决这个问题的主要思路是逐块下载文件并在某处记录最后下载块的偏移量。首先，我们需要指定块的大小。块的大小越接近下载文件的大小，性能越好，但连接中断时丢失数据的风险增加。我的建议是下载文件大小的约10％。根据我的经验，使用这种方法下载速度大约降低了25％。

#!/bin/bash

URL="Your-URL"

content_length=$(curl -I $URL | grep -i 'Content-Length' | awk '{print $2}' | tr -d '\r')

# 以字节为单位
chunk_size=10000000

while true; do
        offset=$(cat offset.txt | grep -Eo '[0-9]+')
        next_chunk=$((offset + chunk_size - 1))
        if [[ next_chunk -ge content_length ]]
        then
                curl -r $offset-$content_length --retry 10000 $URL > downloaded-chunk
                # 在这里，您可以处理最后下载的块
                break
        fi
        curl -r $offset-$next_chunk $URL > downloaded-chunk
        # 在这里，您可以处理已下载的块
        offset=$((next_chunk + 1))
        echo "最后下载块的偏移量：$offset 字节"
        echo $offset > offset.txt
done

需要知道整个文件的大小，以确定在使用 curl 的 -r 范围选项时应该设置哪个范围。通常最后一块数据不会完全适应块的大小，这由 while 循环中的 if 条件处理。

英文:

The main idea of solving this problem is to download the file chunk by chunk and write somewhere the offset of the last downloaded chunk. For this, at first, we are specifying the chunk size. 
The chunk size as much is near the size of downloading file, performance is better but the risk of losing more data in connection corruption increases. My suggestion is around 10 percent of downloading file size. My experience says the speed of downloading by about 25 percent decreased using this method.

#!/bin/bash

URL=&quot;Your-URL&quot;

content_length=$(curl -I $URL | grep -i &#39;Content-Length&#39; | awk &#39;{print $2}&#39; | tr -d &#39;\r&#39;)

# In Bytes
chunk_size=10000000

while true; do
        offset=$(cat offset.txt | grep -Eo &#39;[0-9]+&#39;)
        next_chunk=$((offset + chunk_size - 1))
        if [[ next_chunk -ge content_length ]]
        then
                curl -r $offset-$content_length --retry 10000 $URL &gt; downloaded-chunk
                # In here you can do your process with the last downloaded chunk
                break
        fi
        curl -r $offset-$next_chunk $URL &gt; downloaded-chunk
        # In here you can do your process with a downloaded chunk
        offset=$((next_chunk + 1))
        echo &quot;Offset of the last chunk downloaded: $offset bytes&quot;
        echo $offset &gt; offset.txt
done

The size of the whole file is needed to know where we should have our range in using curl with the range option -r. The last chunk of data is not usually fitted in chunk size, this is handled by the if condition in the while.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

流式下载非常大的文件在不稳定的连接上

问题

答案1

共享对象可以访问加载它的应用程序的内存吗？

Pprof和Golang – 如何解读结果？

在Go语言中使用Windows的DLL库，编译为Linux和Mac OS X平台。

如何在使用 YASM 汇编时更改 64 位 Linux 终端颜色？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论