2023年5月25日 01:00:54go评论88阅读模式

英文:

How do I save an XML file locally from a website using Python and requests module?

问题

我正在使用Python进行网页抓取项目，并遇到了这个链接，它会下载一个XML文件到我的电脑上。

是否有办法访问点击链接时下载的XML文件？如果必须的话，我可以将XML文件保存到本地，但我不知道如何操作。

我尝试使用requests模块，但这样做时我得到了字节字符串。

import requests

r = requests.get(
    "https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601"
)

print(r.content)

英文:

I'm using Python for a web scraping project, and I bumped into this URL that downloads a XML file to my PC.

Is there a way I can access the XML file that's downloaded when you click the link? I'm ok with saving the XML locally if that's the only way, but I have no idea how to do so.

I've tried using the requests module, but I get the byte string when doing so.

import requests

r = requests.get(
    &quot;https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601&quot;
)

print(r.content)

答案1

得分: 0

您需要指定请求头以从特定网站下载文件。

以下是我所做的方式：

import requests

filename = "file.xml"
url = "https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601"
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8'
}

r = requests.get("https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601", headers=headers)

with open(filename, "wb") as file:
    file.write(r.content)

英文:

You need to specify the request headers to download from that specific site.

Here is how i did it:

import requests

filename = &quot;file.xml&quot;
url = &quot;https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601&quot;
headers = {
    &#39;Accept&#39;: &#39;text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8&#39;
}

r = requests.get(&quot;https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601&quot;, headers=headers)

with open(filename, &quot;wb&quot;) as file:
    file.write(r.content)

答案2

得分: 0

以下是翻译好的内容：

给定你检索到的带引号的base64编码消息，如果你愿意，可以去掉 " 引号并解码它。

import ast
from base64 import b64decode

r = requests.get("https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601")
assert r.text.endswith('RPg==&quot;')

print(b64decode(ast.literal_eval(r.text)).decode()[:112])

<?xml version="1.0" encoding="UTF-8"?><DOC_ARQ xmlns="urn:fidc">
  <CAB_INFORM>
    <DT_COMPT>04/2023</DT_COMPT>

请注意，代码部分未进行翻译。

英文:

Given the quoted base64-encoded message you retrieved,
you can strip " quotes and decode it if you want.

import ast
from base64 import b64decode

&gt;&gt;&gt; r = requests.get(&quot;https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601&quot;)
&gt;&gt;&gt; assert r.text.endswith(&#39;RPg==&quot;&#39;)
&gt;&gt;&gt;
&gt;&gt;&gt; print(b64decode(ast.literal_eval(r.text)).decode()[:112])

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;&lt;DOC_ARQ xmlns=&quot;urn:fidc&quot;&gt;
  &lt;CAB_INFORM&gt;
    &lt;DT_COMPT&gt;04/2023&lt;/DT_COMPT&gt;

答案3

得分: 0

响应内容将默认为base64编码。

因此：

import requests
import base64

with requests.get('https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601') as response:
    response.raise_for_status()
    with open('/Volumes/G-Drive/foo.xml', 'w') as output:
        output.write(base64.b64decode(response.content).decode())

但是，如果您添加一个Accept HTTP标头，您可以简化代码如下：

import requests

headers = {
    'Accept': 'text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8'
}

with requests.get('https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601', headers=headers) as response:
    response.raise_for_status()
    with open('/Volumes/G-Drive/foo.xml', 'w') as output:
        output.write(response.text)

如果您担心内存消耗（如果下载的数据非常大，则可能成为问题），那么您应该考虑使用流式传输。

import requests

headers = {
    'Accept': 'text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8'
}

with requests.get('https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601', headers=headers, stream=True) as response:
    response.raise_for_status()
    with open('/Volumes/G-Drive/foo.xml', 'wb') as output:
        for chunk in response.iter_content(1024):
            output.write(chunk)

英文:

The response content will, by default, be base64 encoded.

Therefore:

import requests
import base64

with requests.get(&#39;https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601&#39;) as response:
    response.raise_for_status()
    with open(&#39;/Volumes/G-Drive/foo.xml&#39;, &#39;w&#39;) as output:
        output.write(base64.b64decode(response.content).decode())

However, if you add an Accept HTTP header, you can simplify the code as follows:

import requests

headers = {
    &#39;Accept&#39;: &#39;text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8&#39;
}

with requests.get(&#39;https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601&#39;, headers=headers) as response:
    response.raise_for_status()
    with open(&#39;/Volumes/G-Drive/foo.xml&#39;, &#39;w&#39;) as output:
        output.write(response.text)

If you're concerned about memory consumption (which could theoretically be an issue if the data being downloaded is very large) then you should consider streaming.

import requests

headers = {
    &#39;Accept&#39;: &#39;text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8&#39;
}

with requests.get(&#39;https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601&#39;, headers=headers, stream=True) as response:
    response.raise_for_status()
    with open(&#39;/Volumes/G-Drive/foo.xml&#39;, &#39;wb&#39;) as output:
        for chunk in response.iter_content(1024):
            output.write(chunk)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Python和requests模块从网站上将XML文件保存到本地？

问题

答案1

答案2

答案3

matplotlib.widgets.TextBox 在包含多个子图的图中交互速度较慢。

如何使用Pandas中的数据将时间序列线图转换为条形图？

在Pandas中按索引和名称查找数值

使用命令行（ping/curl/powershell等）向“about:internet”发起一个网络请求。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论