如何使用Python和requests模块从网站上将XML文件保存到本地?

huangapple go评论82阅读模式
英文:

How do I save an XML file locally from a website using Python and requests module?

问题

我正在使用Python进行网页抓取项目,并遇到了这个链接 ,它会下载一个XML文件到我的电脑上。

是否有办法访问点击链接时下载的XML文件?如果必须的话,我可以将XML文件保存到本地,但我不知道如何操作。

我尝试使用requests模块,但这样做时我得到了字节字符串。

import requests

r = requests.get(
    "https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601"
)

print(r.content)
英文:

I'm using Python for a web scraping project, and I bumped into this URL that downloads a XML file to my PC.

Is there a way I can access the XML file that's downloaded when you click the link? I'm ok with saving the XML locally if that's the only way, but I have no idea how to do so.

I've tried using the requests module, but I get the byte string when doing so.

import requests

r = requests.get(
    "https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601"
)

print(r.content)

答案1

得分: 0

您需要指定请求头以从特定网站下载文件。

以下是我所做的方式:

import requests

filename = "file.xml"
url = "https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601"
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8'
}

r = requests.get("https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601", headers=headers)

with open(filename, "wb") as file:
    file.write(r.content)
英文:

You need to specify the request headers to download from that specific site.

Here is how i did it:

import requests

filename = "file.xml"
url = "https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601"
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8'
}

r = requests.get("https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601", headers=headers)

with open(filename, "wb") as file:
    file.write(r.content)

答案2

得分: 0

以下是翻译好的内容:

给定你检索到的带引号的base64编码消息,如果你愿意,可以去掉 " 引号并解码它。

import ast
from base64 import b64decode

r = requests.get("https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601")
assert r.text.endswith('RPg=="')

print(b64decode(ast.literal_eval(r.text)).decode()[:112])

<?xml version="1.0" encoding="UTF-8"?><DOC_ARQ xmlns="urn:fidc">
  <CAB_INFORM>
    <DT_COMPT>04/2023</DT_COMPT>

请注意,代码部分未进行翻译。

英文:

Given the quoted base64-encoded message you retrieved,
you can strip &quot; quotes and decode it if you want.

import ast
from base64 import b64decode

&gt;&gt;&gt; r = requests.get(&quot;https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601&quot;)
&gt;&gt;&gt; assert r.text.endswith(&#39;RPg==&quot;&#39;)
&gt;&gt;&gt;
&gt;&gt;&gt; print(b64decode(ast.literal_eval(r.text)).decode()[:112])

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;&lt;DOC_ARQ xmlns=&quot;urn:fidc&quot;&gt;
  &lt;CAB_INFORM&gt;
    &lt;DT_COMPT&gt;04/2023&lt;/DT_COMPT&gt;

答案3

得分: 0

响应内容将默认为base64编码。

因此:

import requests
import base64

with requests.get('https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601') as response:
    response.raise_for_status()
    with open('/Volumes/G-Drive/foo.xml', 'w') as output:
        output.write(base64.b64decode(response.content).decode())

但是,如果您添加一个Accept HTTP标头,您可以简化代码如下:

import requests

headers = {
    'Accept': 'text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8'
}

with requests.get('https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601', headers=headers) as response:
    response.raise_for_status()
    with open('/Volumes/G-Drive/foo.xml', 'w') as output:
        output.write(response.text)

如果您担心内存消耗(如果下载的数据非常大,则可能成为问题),那么您应该考虑使用流式传输。

import requests

headers = {
    'Accept': 'text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8'
}

with requests.get('https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601', headers=headers, stream=True) as response:
    response.raise_for_status()
    with open('/Volumes/G-Drive/foo.xml', 'wb') as output:
        for chunk in response.iter_content(1024):
            output.write(chunk)
英文:

The response content will, by default, be base64 encoded.

Therefore:

import requests
import base64

with requests.get(&#39;https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601&#39;) as response:
    response.raise_for_status()
    with open(&#39;/Volumes/G-Drive/foo.xml&#39;, &#39;w&#39;) as output:
        output.write(base64.b64decode(response.content).decode())

However, if you add an Accept HTTP header, you can simplify the code as follows:

import requests

headers = {
    &#39;Accept&#39;: &#39;text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8&#39;
}

with requests.get(&#39;https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601&#39;, headers=headers) as response:
    response.raise_for_status()
    with open(&#39;/Volumes/G-Drive/foo.xml&#39;, &#39;w&#39;) as output:
        output.write(response.text)

If you're concerned about memory consumption (which could theoretically be an issue if the data being downloaded is very large) then you should consider streaming.

import requests

headers = {
    &#39;Accept&#39;: &#39;text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8&#39;
}

with requests.get(&#39;https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601&#39;, headers=headers, stream=True) as response:
    response.raise_for_status()
    with open(&#39;/Volumes/G-Drive/foo.xml&#39;, &#39;wb&#39;) as output:
        for chunk in response.iter_content(1024):
            output.write(chunk)

huangapple
  • 本文由 发表于 2023年5月25日 01:00:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76325879.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定