通过wget下载zip文件

huangapple go评论65阅读模式
英文:

Download zip files via wget

问题

以下是翻译好的部分:

  1. 如何每周将计数器增加1?
  2. 在解压时,本地保存的zip数据也会被再次解压,而不仅仅是最新下载的文件。使用cat命令将旧文件和新文件合并。因此,master.pgn文件中会包含相同的棋局。
英文:

I like to play chess and would like to download the games of the Grandmasters starting from Mon 25th Jun 2012 until today and continuously every week, on Monday from the internet as zip file. The zip files are freely available. The zip files have names ordered by a number e.g. twic920g.zip - twic1493g.zip. The next week the number increases by 1 to twic1494g.zip. For the first run this script works.

Here are my questions:

  1. how do I increase the counter by plus 1 every week?
  2. when unpacking, the locally saved zip data is also unpacked again and not only the alktuell downloaded file. With the cat command the old and new files are merged. So the master.pgn has the games twice.
#!/bin/bash

dir="pgn/zip"

if [[ ! -d $dir ]]; then
    mkdir -p $dir
fi

cd $dir

# Download all PGN files
for i in {920..1493}; do
    wget -nc  https://www.theweekinchess.com/zips/twic"$i"g.zip
    unzip twic"$i"g.zip
    cat twic"$i".pgn >> ../master.pgn
    rm twic"$i".pgn
done

答案1

得分: 1

  • 如何每周增加计数器1?
  • 我认为,一旦您下载了历史比赛,就无需担心增加计数器:您可以通过解析从<https://theweekinchess.com/zips/>获取“当前”比赛的链接。
  • 更稳健的解决方案可能需要不同于Shell脚本的内容,但这个脚本可以工作:
curl https://theweekinchess.com/zips/ | grep 'twic[0-9]*g.zip' | cut -f2 -d'"'

例如,运行这个脚本现在会产生:

http://www.theweekinchess.com/zips/twic973g.zip

只需运行一个脚本,每周下载最新的存档(例如,使用 cron)。


或者,您可以将上次成功下载的文件编号写入文件,并在下次运行时将其用作起始值:

#!/bin/bash

dir="pgn/zip"

if [[ ! -d $dir ]]; then
mkdir -p $dir
fi

cd $dir

# figure out number of last successfully fetched game
last_fetched=$(cat last_fetched 2> /dev/null || echo 0)

if (( last_fetched == 0 )); then
first=920
else
first=$(( last_fetched + 1 ))
fi

echo "starting with: $first"

# Download all PGN files
for (( i=first; 1; i++ )); do
# don't download a file if it already exists
[[ -f "twic${i}g.zip" ]] && continue

echo "fetching game $i"
curl -sSfLO  "https://www.theweekinchess.com/zips/twic${i}g.zip" || break
echo "$i" > last_fetched
unzip -p twic"$i"g.zip >> ../master.pgn
done
  • 在解压缩时,也会再次解压缩本地保存的zip数据,而不仅仅是...下载的文件。使用cat命令合并了旧文件和新文件。因此,master.pgn文件中的游戏会重复。
  • 我不太明白您的意思。您只会解压缩您刚刚下载的文件,所以任何现有的zip文件不应该有关系。
  • 您可以在每个循环迭代中将内容追加到master.pgn,或者您可以在脚本末尾完全重新生成master.pgn
for (( i=first; 1; i++ )); do
# don't download a file if it already exists
[[ -f "twic${i}g.zip" ]] && continue

echo "fetching game $i"
curl -sSfLO  "https://www.theweekinchess.com/zips/twic${i}g.zip" || break
echo "$i" > last_fetched
unzip twic"$i"g.zip
done

cat *.pgn >> ../master.pgn
英文:

> how do I increase the counter by plus 1 every week?

I think once you've downloaded the historic games you don't need to worry about incrementing a counter: you can get the link for the "current" game by parsing content from <https://theweekinchess.com/zips/>.

A more robust solution would probably require something other than a shell script, but this works:

curl https://theweekinchess.com/zips/ | grep &#39;twic[0-9]*g.zip&#39; | cut -f2 -d&#39;&quot;&#39;

For example, running that right now produces:

http://www.theweekinchess.com/zips/twic973g.zip

Just run a script to download the latest archive once a week (e.g., using cron).


Alternately, you could write the number of the last file downloaded successfully to a file, and use that as the starting value next time it runs:

#!/bin/bash

dir=&quot;pgn/zip&quot;

if [[ ! -d $dir ]]; then
mkdir -p $dir
fi

cd $dir

# figure out number of last successfully fetched game
last_fetched=$(cat last_fetched 2&gt; /dev/null || echo 0)

if (( last_fetched == 0 )); then
	first=920
else
	first=$(( last_fetched + 1 ))
fi

echo &quot;starting with: $first&quot;

# Download all PGN files
for (( i=first; 1; i++ )); do
	# don&#39;t download a file if it already exists
	[[ -f &quot;twic${i}g.zip&quot; ]] &amp;&amp; continue

	echo &quot;fetching game $i&quot;
	curl -sSfLO  &quot;https://www.theweekinchess.com/zips/twic${i}g.zip&quot; || break
	echo &quot;$i&quot; &gt; last_fetched
	unzip -p twic&quot;$i&quot;g.zip &gt;&gt; ../master.pgn
done

> when unpacking, the locally saved zip data is also unpacked again and not only the ... downloaded file. With the cat command the old and new files are merged. So the master.pgn has the games twice.

I'm not sure what you're saying here. You're only unpacking the file you've just downloaded, so any existing zip files shouldn't matter.

Instead of appending to master.pgn in every loop iteration, you could leave the unpacked files on disk and completely regenerate master.pgn at the end of the script:

for (( i=first; 1; i++ )); do
	# don&#39;t download a file if it already exists
	[[ -f &quot;twic${i}g.zip&quot; ]] &amp;&amp; continue

	echo &quot;fetching game $i&quot;
	curl -sSfLO  &quot;https://www.theweekinchess.com/zips/twic${i}g.zip&quot; || break
	echo &quot;$i&quot; &gt; last_fetched
	unzip twic&quot;$i&quot;g.zip
done

cat *.pgn &gt; ../master.pgn

答案2

得分: 0

我建议使用只有 wget 的方法来下载最新的 g.zip 文件。

wget -nc -r -nd -A g.zip https://theweekinchess.com/zips/

解释:我使用 GNU wget递归下载功能,这意味着 wget 将遍历给定 URL 中找到的链接(请注意,它会导航到页面,而不是特定的 zip 文件)。找到的资源将被下载到当前目录(-nd)如果它们不存在(-nc),并且只会保留文件名以 g.zip 结尾的文件(-A g.zip)。

英文:

I propose following approach using just wget to download most recent ...g.zip file

wget -nc -r -nd -A g.zip https://theweekinchess.com/zips/

Explanation: I use Recursive Download feature of GNU wget, which will mean wget will traverse links which it find in given URL (note it leads to page, not particular zip file). Find resources will be downloaded into current directory (-nd) if they do not exist already (-nc) and only files with name ending with g.zip (-A g.zip) will be kept.

huangapple
  • 本文由 发表于 2023年6月27日 19:10:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76564260.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定