wget 从输出中区分头部和正文

huangapple go评论84阅读模式
英文:

wget distinguish header from body in output

问题

在以下命令的输出中,

wget www.google.com --save-headers --output-document - --quiet

你如何确定哪些行是头部,以及正文从哪里开始(例如,将不同部分传送到不同的管道中)?

更新

# r=$(wget www.google.com --save-headers --output-document - --quiet)
# status=$(echo $r | grep HTTP | awk '{ print $2 }')
# body=$(echo $r | awk '{ if( body ){ print $0 }; if( $0 ~ /^$/ ){ body=1 } }')

然而,$body 是空的。

更新 2

body=$(echo "$r" | awk '{ if( $1 ~ /^[\s\r\n]*$/ ) { b=1 }; if( b ) { print $0 } }')

关于 $r 周围的引号。真是个麻烦。

英文:

In the output of

wget www.google.com --save-headers --output-document - --quiet

how can you tell which lines are the headers and where the body starts (e.g., to tee the different parts into different pipelines)

Update

# r=$(wget www.google.com --save-headers --output-document - --quiet)
# status=$(echo $r | grep HTTP | awk '{ print $2 }')
# body=$(echo $r | awk '{ if( body ){ print $0 };if( $0 ~ /^$/ ){ body=1 } }')

However, $body is empty.

Uodate 2

body=$(echo "$r" | awk '{ if( $1 ~ /^[\s\r\n]*$/ ) { b=1 }; if( b ) { print $0 } }')

Quotes around $r. What a bugger.

答案1

得分: 1

RFC1945规定:

实体主体与标头之间由一个空行分隔(即,CRLF之前没有任何内容的行)。

因此,在HTTP响应中,标头位于第一个空行之前,主体位于该行之后。GNU wget的--save-headers选项也遵循相同的方式:

将HTTP服务器发送的标头保存到文件中,位于实际内容之前,以一个空行作为分隔符。

由于使用CRLF行尾,标头位于第一个CRLFCRLF\r\n\r\n)之前,主体位于之后。对于这一部分,我会使用python,首先将响应下载为名为response的文件:

wget www.example.com --save-headers --output-document response --quiet

然后创建splitter.py,内容如下:

with open("response", "rb") as f:
    headers, body = f.read().split(b"\r\n\r\n", 1)
with open("headers", "wb") as f:
    f.write(headers)
    f.write(b"\r\n")
with open("body", "wb") as f:
    f.write(body)

并运行它:

python splitter.py

我使用二进制模式(b)以使其适用于任何编码,并在标头之后写入\r\n,因为它是最后一个键值对的CRLF。请随意使用您习惯的其他工具来进行拆分。

英文:

> how can you tell which lines are the headers and where the body starts

RFC1945 stipulates that

> The entity body is separated from the headers by a null line (i.e., a
> line with nothing preceding the CRLF).

so headers are before first blank line and body after said line in HTTP response. --save-headers option of GNU wget does follow suit

> Save the headers sent by the HTTP server to the file, preceding the
> actual contents, with an empty line as the separator.

As CRLF line endings are used headers are before first CRLFCRLF (\r\n\r\n) and body is after it. I would use python for that part following way, first download response as file named response

wget www.example.com --save-headers --output-document response --quiet

then create splitter.py as follows

with open("response", "rb") as f:
    headers, body = f.read().split(b"\r\n\r\n", 1)
with open("headers", "wb") as f:
    f.write(headers)
    f.write(b"\r\n")
with open("body", "wb") as f:
    f.write(body)

and run it

python splitter.py

I use binary (b) mode so it would work with any encoding and write \r\n after headers as it is CRLF of last key-value pair. Feel free to use any other tool you are comfortable working for making split.

答案2

得分: 0

r=$(wget www.example.com --save-headers --quiet --load-cookies /root/cookies.txt --save-cookies /root/cookies.txt --keep-session-cookies --output-document - 2>/dev/null )

status=$(echo "$r" | grep HTTP | awk '{ print $2 }')

if [ "$status" = "200" ]; then
        body=$(echo "$r" | awk '{ if( body ){ print $0 };if( $0 ~ /^[\s\r\n]*$/ ){ body=1 } }')
else
    exit 1
fi
英文:
r=$(wget www.example.com --save-headers --quiet --load-cookies /root/cookies.txt --save-cookies /root/cookies.txt --keep-session-cookies --output-document - 2>/dev/null )

status=$(echo "$r" | grep HTTP | awk '{ print $2 }')

if [ "$status" = "200" ]; then
        body=$(echo "$r" | awk '{ if( body ){ print $0 };if( $0 ~ /^[\s\r\n]*$/ ){ body=1 } }')
else
    exit 1
fi


</details>



huangapple
  • 本文由 发表于 2023年2月16日 17:54:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/75470525.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定