英文:
stream download very large file over bad connection
问题
我想要以流的方式处理一个非常大的文件(几个太字节)。这个文件可以通过HTTP协议访问,URL如下:
http://example.com/some-file
可以使用以下命令来实现这个目标:
wget -q -O - http://some-host/some-file | process-command
但是如果连接中断,我必须从头开始处理。
由于文件太大,无法使用wget -c
来存储文件。
是否有另一条命令可以流式传输远程文件,并在内部处理所需的重新连接?
英文:
I want to process a very large file (a few terabytes) as a stream. This file is accessible via http protocol and from a URL such as:
http://example.com/some-file
This command can do that:
wget -q -O - http://some-host/some-file | process-command
but if connection is lost, I must begin the process from the beginning.
wget -c
cannot be used because I can't store the file due to large size.
Is there another command that can stream the remote file, while handling the required reconnects internally?
答案1
得分: 1
以下是翻译好的部分:
解决这个问题的主要思路是逐块下载文件并在某处记录最后下载块的偏移量。首先,我们需要指定块的大小。块的大小越接近下载文件的大小,性能越好,但连接中断时丢失数据的风险增加。我的建议是下载文件大小的约10%。根据我的经验,使用这种方法下载速度大约降低了25%。
#!/bin/bash
URL="Your-URL"
content_length=$(curl -I $URL | grep -i 'Content-Length' | awk '{print $2}' | tr -d '\r')
# 以字节为单位
chunk_size=10000000
while true; do
offset=$(cat offset.txt | grep -Eo '[0-9]+')
next_chunk=$((offset + chunk_size - 1))
if [[ next_chunk -ge content_length ]]
then
curl -r $offset-$content_length --retry 10000 $URL > downloaded-chunk
# 在这里,您可以处理最后下载的块
break
fi
curl -r $offset-$next_chunk $URL > downloaded-chunk
# 在这里,您可以处理已下载的块
offset=$((next_chunk + 1))
echo "最后下载块的偏移量:$offset 字节"
echo $offset > offset.txt
done
需要知道整个文件的大小,以确定在使用 curl
的 -r
范围选项时应该设置哪个范围。通常最后一块数据不会完全适应块的大小,这由 while
循环中的 if
条件处理。
英文:
The main idea of solving this problem is to download the file chunk by chunk and write somewhere the offset of the last downloaded chunk. For this, at first, we are specifying the chunk size. <br /> <br />
The chunk size as much is near the size of downloading file, performance is better but the risk of losing more data in connection corruption increases. My suggestion is around 10 percent of downloading file size. My experience says the speed of downloading by about 25
percent decreased using this method.
<br />
<br />
#!/bin/bash
URL="Your-URL"
content_length=$(curl -I $URL | grep -i 'Content-Length' | awk '{print $2}' | tr -d '\r')
# In Bytes
chunk_size=10000000
while true; do
offset=$(cat offset.txt | grep -Eo '[0-9]+')
next_chunk=$((offset + chunk_size - 1))
if [[ next_chunk -ge content_length ]]
then
curl -r $offset-$content_length --retry 10000 $URL > downloaded-chunk
# In here you can do your process with the last downloaded chunk
break
fi
curl -r $offset-$next_chunk $URL > downloaded-chunk
# In here you can do your process with a downloaded chunk
offset=$((next_chunk + 1))
echo "Offset of the last chunk downloaded: $offset bytes"
echo $offset > offset.txt
done
The size of the whole file is needed to know where we should have our range in using curl
with the range option -r
. The last chunk of data is not usually fitted in chunk size, this is handled by the if
condition in the while.
<br />
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论