流式下载非常大的文件在不稳定的连接上

huangapple go评论73阅读模式
英文:

stream download very large file over bad connection

问题

我想要以流的方式处理一个非常大的文件(几个太字节)。这个文件可以通过HTTP协议访问,URL如下:

http://example.com/some-file

可以使用以下命令来实现这个目标:

wget -q -O - http://some-host/some-file | process-command

但是如果连接中断,我必须从头开始处理。

由于文件太大,无法使用wget -c来存储文件。

是否有另一条命令可以流式传输远程文件,并在内部处理所需的重新连接?

英文:

I want to process a very large file (a few terabytes) as a stream. This file is accessible via http protocol and from a URL such as:

http://example.com/some-file

This command can do that:

wget -q -O - http://some-host/some-file | process-command

but if connection is lost, I must begin the process from the beginning.

wget -c cannot be used because I can't store the file due to large size.

Is there another command that can stream the remote file, while handling the required reconnects internally?

答案1

得分: 1

以下是翻译好的部分:

解决这个问题的主要思路是逐块下载文件并在某处记录最后下载块的偏移量。首先,我们需要指定块的大小。块的大小越接近下载文件的大小,性能越好,但连接中断时丢失数据的风险增加。我的建议是下载文件大小的约10%。根据我的经验,使用这种方法下载速度大约降低了25%。

#!/bin/bash

URL="Your-URL"

content_length=$(curl -I $URL | grep -i 'Content-Length' | awk '{print $2}' | tr -d '\r')

# 以字节为单位
chunk_size=10000000

while true; do
        offset=$(cat offset.txt | grep -Eo '[0-9]+')
        next_chunk=$((offset + chunk_size - 1))
        if [[ next_chunk -ge content_length ]]
        then
                curl -r $offset-$content_length --retry 10000 $URL > downloaded-chunk
                # 在这里,您可以处理最后下载的块
                break
        fi
        curl -r $offset-$next_chunk $URL > downloaded-chunk
        # 在这里,您可以处理已下载的块
        offset=$((next_chunk + 1))
        echo "最后下载块的偏移量:$offset 字节"
        echo $offset > offset.txt
done

需要知道整个文件的大小,以确定在使用 curl-r 范围选项时应该设置哪个范围。通常最后一块数据不会完全适应块的大小,这由 while 循环中的 if 条件处理。

英文:

The main idea of solving this problem is to download the file chunk by chunk and write somewhere the offset of the last downloaded chunk. For this, at first, we are specifying the chunk size. <br /> <br />
The chunk size as much is near the size of downloading file, performance is better but the risk of losing more data in connection corruption increases. My suggestion is around 10 percent of downloading file size. My experience says the speed of downloading by about 25 percent decreased using this method.
<br />
<br />

#!/bin/bash

URL=&quot;Your-URL&quot;

content_length=$(curl -I $URL | grep -i &#39;Content-Length&#39; | awk &#39;{print $2}&#39; | tr -d &#39;\r&#39;)

# In Bytes
chunk_size=10000000

while true; do
        offset=$(cat offset.txt | grep -Eo &#39;[0-9]+&#39;)
        next_chunk=$((offset + chunk_size - 1))
        if [[ next_chunk -ge content_length ]]
        then
                curl -r $offset-$content_length --retry 10000 $URL &gt; downloaded-chunk
                # In here you can do your process with the last downloaded chunk
                break
        fi
        curl -r $offset-$next_chunk $URL &gt; downloaded-chunk
        # In here you can do your process with a downloaded chunk
        offset=$((next_chunk + 1))
        echo &quot;Offset of the last chunk downloaded: $offset bytes&quot;
        echo $offset &gt; offset.txt
done

The size of the whole file is needed to know where we should have our range in using curl with the range option -r. The last chunk of data is not usually fitted in chunk size, this is handled by the if condition in the while.
<br />

huangapple
  • 本文由 发表于 2023年7月6日 15:32:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76626472.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定