英文:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe3' in position 7: ordinal not in range(128)
问题
一些天前,我接受了这个回答,将其视为我的问题的正确答案,但过了一会儿,我注意到在某些URL中,我遇到了以下错误:
2023-05-29 19:22:20 [scrapy.core.scraper] ERROR: Spider error processing <POST https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx/ExibirPDF> (referer: https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx?NumeroProtocoloEntrega=1106380)
Traceback (most recent call last):
  File "/home/higo/anaconda3/lib/python3.9/base64.py", line 37, in _bytes_from_decode_data
    return s.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\xe3' in position 7: ordinal not in range(128)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/higo/anaconda3/lib/python3.9/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/higo/Documentos/Doutorado/Artigo/scrape_fatos/scrape_fatos/spiders/fatos.py", line 63, in download_pdf
    pdf = base64.b64decode(b64)
  File "/home/higo/anaconda3/lib/python3.9/base64.py", line 80, in b64decode
    s = _bytes_from_decode_data(s)
  File "/home/higo/anaconda3/lib/python3.9/base64.py", line 39, in _bytes_from_decode_data
    raise ValueError('string argument should contain only ASCII characters')
ValueError: string argument should contain only ASCII characters
对我来说,这似乎很奇怪,尤其是因为错误消息中提到的URL在正常工作。
我尝试更改download_pdf方法:
def download_pdf(self, response, protocol_num):
    json_data = response.json()
    b64 = json_data.get('d')
    if b64:
        # 过滤掉非ASCII字符
        filtered_b64 = re.sub(r'[^A-Za-z0-9+/=]', '', b64)
    
        pdf = base64.b64decode(filtered_b64)
        filename = f'{protocol_num}.pdf'
        p = os.path.join(self.base_dir, filename)
        if not os.path.isdir(self.base_dir):
            os.mkdir(self.base_dir)
        with open(p, 'wb') as f:
            f.write(pdf)
        self.log(f"Saved {filename} in {self.base_dir}")
    else:
        self.log("Couldn't download pdf", logging.ERROR)
但我没有成功:通过这种更改,简单地说,所有保存的PDF都损坏了。
经过一些小的更改,我的完整代码如下:
# 这里是您的完整Python代码,包括导入和函数定义部分。
# 如何解决这种情况?
我理解您的代码和问题描述,但不确定您需要哪部分进行翻译。如果您需要任何特定部分的翻译,请告诉我。
英文:
Some days ago, accepted this answer to a question of mine as correct, but after a while I noticed that in some URLs, I got the following error:
2023-05-29 19:22:20 [scrapy.core.scraper] ERROR: Spider error processing <POST https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx/ExibirPDF> (referer: https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx?NumeroProtocoloEntrega=1106380)
Traceback (most recent call last):
  File "/home/higo/anaconda3/lib/python3.9/base64.py", line 37, in _bytes_from_decode_data
    return s.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\xe3' in position 7: ordinal not in range(128)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/higo/anaconda3/lib/python3.9/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/higo/Documentos/Doutorado/Artigo/scrape_fatos/scrape_fatos/spiders/fatos.py", line 63, in download_pdf
    pdf = base64.b64decode(b64)
  File "/home/higo/anaconda3/lib/python3.9/base64.py", line 80, in b64decode
    s = _bytes_from_decode_data(s)
  File "/home/higo/anaconda3/lib/python3.9/base64.py", line 39, in _bytes_from_decode_data
    raise ValueError('string argument should contain only ASCII characters')
ValueError: string argument should contain only ASCII characters
it seemed strange to me, even more so because the aforementioned URL in the error message works normally.
I tried changing the download_pdf method:
def download_pdf(self, response, protocol_num):
        json_data = response.json()
        b64 = json_data.get('d')
        if b64:
            # Filter out non-ASCII characters
            filtered_b64 = re.sub(r'[^A-Za-z0-9+/=]', '', b64)
        
            pdf = base64.b64decode(filtered_b64)
            filename = f'{protocol_num}.pdf'
            p = os.path.join(self.base_dir, filename)
            if not os.path.isdir(self.base_dir):
                os.mkdir(self.base_dir)
            with open(p, 'wb') as f:
                f.write(pdf)
            self.log(f"Saved {filename} in {self.base_dir}")
        else:
            self.log("Couldn't download pdf", logging.ERROR)
but I was not successful: with this change, simply all the saved PDFs were corrupted.
After some little changes, my full code is as follows:
import base64
import logging
import os
import re
from urllib.parse import unquote
import scrapy
class FatosSpider(scrapy.Spider):
    name = 'fatos'
    allowed_domains = ['cvm.gov.br']
    with open("urls.txt", "rt") as f:
        start_urls = [url.strip() for url in f.readlines()]
    base_dir = './pdf_downloads'
    def parse(self, response):
        id_ = self.get_parameter_by_name("ID", response.url)
        if id_:
            numeroProtocolo = id_
            codInstituicao = 2
        else:
            numeroProtocolo = self.get_parameter_by_name("NumeroProtocoloEntrega", response.url)
            codInstituicao = 1
        dataValue = "{ codigoInstituicao: '" + str(codInstituicao) + "', numeroProtocolo: '" + str(numeroProtocolo) + "'"
        token = response.xpath('//*[@id="hdnTokenB3"]/@value').get(default='')
        versaoCaptcha = ''
        if response.xpath('//*[@id="hdnHabilitaCaptcha"]/@value').get(default='') == 'S':
            if not token:
                versaoCaptcha = 'V3'
        payload = dataValue + ", token: '" + token + "', versaoCaptcha: '" + versaoCaptcha + "'}"
        url = 'https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx/ExibirPDF'
        headers = {
            "Accept": "application/json, text/javascript, */*; q=0.01",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "en-US,en;q=0.5",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "Content-Type": "application/json; charset=utf-8",
            "DNT": "1",
            "Host": "www.rad.cvm.gov.br",
            "Origin": "https://www.rad.cvm.gov.br",
            "Pragma": "no-cache",
            "Referer": f"https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx?NumeroProtocoloEntrega={numeroProtocolo}",
            "Sec-Fetch-Dest": "empty",
            "Sec-Fetch-Mode": "cors",
            "Sec-Fetch-Site": "same-origin",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0",
            "X-Requested-With": "XMLHttpRequest"
        }
        yield scrapy.Request(url=url, headers=headers, body=payload, method='POST', callback=self.download_pdf, cb_kwargs={'protocol_num': numeroProtocolo})
    
    def download_pdf(self, response, protocol_num):
        json_data = response.json()
        b64 = json_data.get('d')
        if b64:
            pdf = base64.b64decode(b64)
            filename = f'{protocol_num}.pdf'
            p = os.path.join(self.base_dir, filename)
            if not os.path.isdir(self.base_dir):
                os.mkdir(self.base_dir)
            with open(p, 'wb') as f:
                f.write(pdf)
            self.log(f"Saved {filename} in {self.base_dir}")
        else:
            self.log("Couldn't download pdf", logging.ERROR)
    @staticmethod
    def get_parameter_by_name(name, url):
        name = name.replace('[', '\\[').replace(']', '\\]')
        results = re.search(r"[?&]" + name + r"(=([^&#]*)|&|#|$)", url)
        if not results:
            return None
        if len(results.groups()) < 2 or not results[2]:
            return ''
        return unquote(results[2])
How could this situation be resolved?
答案1
得分: 1
我只是找到了解决方案,基于这个答案在另一个线程中。以下行
pdf = base64.b64decode(b64)
已修改为
pdf = base64.b64decode(bytes(b64, 'latin-1'))
现在,所有的PDF文件都能正确下载。
英文:
I just find the solution, based on this answer in another thread. The following line
pdf = base64.b64decode(b64)
is modified to
pdf = base64.b64decode(bytes(b64, 'latin-1'))
Now, all the PDF files are correctly downloaded.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论