英文:
How do I save an XML file locally from a website using Python and requests module?
问题
我正在使用Python进行网页抓取项目,并遇到了这个链接 ,它会下载一个XML文件到我的电脑上。
是否有办法访问点击链接时下载的XML文件?如果必须的话,我可以将XML文件保存到本地,但我不知道如何操作。
我尝试使用requests
模块,但这样做时我得到了字节字符串。
import requests
r = requests.get(
"https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601"
)
print(r.content)
英文:
I'm using Python for a web scraping project, and I bumped into this URL that downloads a XML file to my PC.
Is there a way I can access the XML file that's downloaded when you click the link? I'm ok with saving the XML locally if that's the only way, but I have no idea how to do so.
I've tried using the requests
module, but I get the byte string when doing so.
import requests
r = requests.get(
"https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601"
)
print(r.content)
答案1
得分: 0
您需要指定请求头以从特定网站下载文件。
以下是我所做的方式:
import requests
filename = "file.xml"
url = "https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601"
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8'
}
r = requests.get("https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601", headers=headers)
with open(filename, "wb") as file:
file.write(r.content)
英文:
You need to specify the request headers to download from that specific site.
Here is how i did it:
import requests
filename = "file.xml"
url = "https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601"
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8'
}
r = requests.get("https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601", headers=headers)
with open(filename, "wb") as file:
file.write(r.content)
答案2
得分: 0
以下是翻译好的内容:
给定你检索到的带引号的base64编码消息,如果你愿意,可以去掉 "
引号并解码它。
import ast
from base64 import b64decode
r = requests.get("https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601")
assert r.text.endswith('RPg=="')
print(b64decode(ast.literal_eval(r.text)).decode()[:112])
<?xml version="1.0" encoding="UTF-8"?><DOC_ARQ xmlns="urn:fidc">
<CAB_INFORM>
<DT_COMPT>04/2023</DT_COMPT>
请注意,代码部分未进行翻译。
英文:
Given the quoted base64-encoded message you retrieved,
you can strip "
quotes and decode it if you want.
import ast
from base64 import b64decode
>>> r = requests.get("https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601")
>>> assert r.text.endswith('RPg=="')
>>>
>>> print(b64decode(ast.literal_eval(r.text)).decode()[:112])
<?xml version="1.0" encoding="UTF-8"?><DOC_ARQ xmlns="urn:fidc">
<CAB_INFORM>
<DT_COMPT>04/2023</DT_COMPT>
答案3
得分: 0
响应内容将默认为base64编码。
因此:
import requests
import base64
with requests.get('https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601') as response:
response.raise_for_status()
with open('/Volumes/G-Drive/foo.xml', 'w') as output:
output.write(base64.b64decode(response.content).decode())
但是,如果您添加一个Accept HTTP标头,您可以简化代码如下:
import requests
headers = {
'Accept': 'text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8'
}
with requests.get('https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601', headers=headers) as response:
response.raise_for_status()
with open('/Volumes/G-Drive/foo.xml', 'w') as output:
output.write(response.text)
如果您担心内存消耗(如果下载的数据非常大,则可能成为问题),那么您应该考虑使用流式传输。
import requests
headers = {
'Accept': 'text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8'
}
with requests.get('https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601', headers=headers, stream=True) as response:
response.raise_for_status()
with open('/Volumes/G-Drive/foo.xml', 'wb') as output:
for chunk in response.iter_content(1024):
output.write(chunk)
英文:
The response content will, by default, be base64 encoded.
Therefore:
import requests
import base64
with requests.get('https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601') as response:
response.raise_for_status()
with open('/Volumes/G-Drive/foo.xml', 'w') as output:
output.write(base64.b64decode(response.content).decode())
However, if you add an Accept HTTP header, you can simplify the code as follows:
import requests
headers = {
'Accept': 'text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8'
}
with requests.get('https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601', headers=headers) as response:
response.raise_for_status()
with open('/Volumes/G-Drive/foo.xml', 'w') as output:
output.write(response.text)
If you're concerned about memory consumption (which could theoretically be an issue if the data being downloaded is very large) then you should consider streaming.
import requests
headers = {
'Accept': 'text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8'
}
with requests.get('https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=465601', headers=headers, stream=True) as response:
response.raise_for_status()
with open('/Volumes/G-Drive/foo.xml', 'wb') as output:
for chunk in response.iter_content(1024):
output.write(chunk)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论