2023年5月30日 02:42:32go评论175阅读模式

英文:

UnicodeEncodeError: 'ascii' codec can't encode character '\xe3' in position 7: ordinal not in range(128)

问题

一些天前，我接受了这个回答，将其视为我的问题的正确答案，但过了一会儿，我注意到在某些URL中，我遇到了以下错误：

2023-05-29 19:22:20 [scrapy.core.scraper] ERROR: Spider error processing &lt;POST https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx/ExibirPDF&gt; (referer: https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx?NumeroProtocoloEntrega=1106380)
Traceback (most recent call last):
  File &quot;/home/higo/anaconda3/lib/python3.9/base64.py&quot;, line 37, in _bytes_from_decode_data
    return s.encode(&#39;ascii&#39;)
UnicodeEncodeError: &#39;ascii&#39; codec can&#39;t encode character &#39;\xe3&#39; in position 7: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File &quot;/home/higo/anaconda3/lib/python3.9/site-packages/twisted/internet/defer.py&quot;, line 857, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File &quot;/home/higo/Documentos/Doutorado/Artigo/scrape_fatos/scrape_fatos/spiders/fatos.py&quot;, line 63, in download_pdf
    pdf = base64.b64decode(b64)
  File &quot;/home/higo/anaconda3/lib/python3.9/base64.py&quot;, line 80, in b64decode
    s = _bytes_from_decode_data(s)
  File &quot;/home/higo/anaconda3/lib/python3.9/base64.py&quot;, line 39, in _bytes_from_decode_data
    raise ValueError(&#39;string argument should contain only ASCII characters&#39;)
ValueError: string argument should contain only ASCII characters

对我来说，这似乎很奇怪，尤其是因为错误消息中提到的URL在正常工作。

我尝试更改download_pdf方法：

def download_pdf(self, response, protocol_num):
    json_data = response.json()
    b64 = json_data.get(&#39;d&#39;)

    if b64:
        # 过滤掉非ASCII字符
        filtered_b64 = re.sub(r&#39;[^A-Za-z0-9+/=]&#39;, &#39;&#39;, b64)
    
        pdf = base64.b64decode(filtered_b64)
        filename = f&#39;{protocol_num}.pdf&#39;
        p = os.path.join(self.base_dir, filename)

        if not os.path.isdir(self.base_dir):
            os.mkdir(self.base_dir)

        with open(p, &#39;wb&#39;) as f:
            f.write(pdf)

        self.log(f&quot;Saved {filename} in {self.base_dir}&quot;)
    else:
        self.log(&quot;Couldn't download pdf&quot;, logging.ERROR)

但我没有成功：通过这种更改，简单地说，所有保存的PDF都损坏了。

经过一些小的更改，我的完整代码如下：

# 这里是您的完整Python代码，包括导入和函数定义部分。

# 如何解决这种情况？

我理解您的代码和问题描述，但不确定您需要哪部分进行翻译。如果您需要任何特定部分的翻译，请告诉我。

英文:

Some days ago, accepted this answer to a question of mine as correct, but after a while I noticed that in some URLs, I got the following error:

2023-05-29 19:22:20 [scrapy.core.scraper] ERROR: Spider error processing &lt;POST https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx/ExibirPDF&gt; (referer: https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx?NumeroProtocoloEntrega=1106380)
Traceback (most recent call last):
  File &quot;/home/higo/anaconda3/lib/python3.9/base64.py&quot;, line 37, in _bytes_from_decode_data
    return s.encode(&#39;ascii&#39;)
UnicodeEncodeError: &#39;ascii&#39; codec can&#39;t encode character &#39;\xe3&#39; in position 7: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File &quot;/home/higo/anaconda3/lib/python3.9/site-packages/twisted/internet/defer.py&quot;, line 857, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File &quot;/home/higo/Documentos/Doutorado/Artigo/scrape_fatos/scrape_fatos/spiders/fatos.py&quot;, line 63, in download_pdf
    pdf = base64.b64decode(b64)
  File &quot;/home/higo/anaconda3/lib/python3.9/base64.py&quot;, line 80, in b64decode
    s = _bytes_from_decode_data(s)
  File &quot;/home/higo/anaconda3/lib/python3.9/base64.py&quot;, line 39, in _bytes_from_decode_data
    raise ValueError(&#39;string argument should contain only ASCII characters&#39;)
ValueError: string argument should contain only ASCII characters

it seemed strange to me, even more so because the aforementioned URL in the error message works normally.

I tried changing the download_pdf method:

def download_pdf(self, response, protocol_num):
        json_data = response.json()
        b64 = json_data.get(&#39;d&#39;)

        if b64:
            # Filter out non-ASCII characters
            filtered_b64 = re.sub(r&#39;[^A-Za-z0-9+/=]&#39;, &#39;&#39;, b64)
        
            pdf = base64.b64decode(filtered_b64)
            filename = f&#39;{protocol_num}.pdf&#39;
            p = os.path.join(self.base_dir, filename)

            if not os.path.isdir(self.base_dir):
                os.mkdir(self.base_dir)

            with open(p, &#39;wb&#39;) as f:
                f.write(pdf)

            self.log(f&quot;Saved {filename} in {self.base_dir}&quot;)
        else:
            self.log(&quot;Couldn&#39;t download pdf&quot;, logging.ERROR)

but I was not successful: with this change, simply all the saved PDFs were corrupted.

After some little changes, my full code is as follows:

import base64
import logging
import os
import re
from urllib.parse import unquote
import scrapy


class FatosSpider(scrapy.Spider):
    name = &#39;fatos&#39;
    allowed_domains = [&#39;cvm.gov.br&#39;]
    with open(&quot;urls.txt&quot;, &quot;rt&quot;) as f:
        start_urls = [url.strip() for url in f.readlines()]
    base_dir = &#39;./pdf_downloads&#39;

    def parse(self, response):
        id_ = self.get_parameter_by_name(&quot;ID&quot;, response.url)

        if id_:
            numeroProtocolo = id_
            codInstituicao = 2
        else:
            numeroProtocolo = self.get_parameter_by_name(&quot;NumeroProtocoloEntrega&quot;, response.url)
            codInstituicao = 1

        dataValue = &quot;{ codigoInstituicao: &#39;&quot; + str(codInstituicao) + &quot;&#39;, numeroProtocolo: &#39;&quot; + str(numeroProtocolo) + &quot;&#39;&quot;
        token = response.xpath(&#39;//*[@id=&quot;hdnTokenB3&quot;]/@value&#39;).get(default=&#39;&#39;)

        versaoCaptcha = &#39;&#39;
        if response.xpath(&#39;//*[@id=&quot;hdnHabilitaCaptcha&quot;]/@value&#39;).get(default=&#39;&#39;) == &#39;S&#39;:
            if not token:
                versaoCaptcha = &#39;V3&#39;

        payload = dataValue + &quot;, token: &#39;&quot; + token + &quot;&#39;, versaoCaptcha: &#39;&quot; + versaoCaptcha + &quot;&#39;}&quot;

        url = &#39;https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx/ExibirPDF&#39;
        headers = {
            &quot;Accept&quot;: &quot;application/json, text/javascript, */*; q=0.01&quot;,
            &quot;Accept-Encoding&quot;: &quot;gzip, deflate, br&quot;,
            &quot;Accept-Language&quot;: &quot;en-US,en;q=0.5&quot;,
            &quot;Cache-Control&quot;: &quot;no-cache&quot;,
            &quot;Connection&quot;: &quot;keep-alive&quot;,
            &quot;Content-Type&quot;: &quot;application/json; charset=utf-8&quot;,
            &quot;DNT&quot;: &quot;1&quot;,
            &quot;Host&quot;: &quot;www.rad.cvm.gov.br&quot;,
            &quot;Origin&quot;: &quot;https://www.rad.cvm.gov.br&quot;,
            &quot;Pragma&quot;: &quot;no-cache&quot;,
            &quot;Referer&quot;: f&quot;https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx?NumeroProtocoloEntrega={numeroProtocolo}&quot;,
            &quot;Sec-Fetch-Dest&quot;: &quot;empty&quot;,
            &quot;Sec-Fetch-Mode&quot;: &quot;cors&quot;,
            &quot;Sec-Fetch-Site&quot;: &quot;same-origin&quot;,
            &quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0&quot;,
            &quot;X-Requested-With&quot;: &quot;XMLHttpRequest&quot;
        }

        yield scrapy.Request(url=url, headers=headers, body=payload, method=&#39;POST&#39;, callback=self.download_pdf, cb_kwargs={&#39;protocol_num&#39;: numeroProtocolo})
    
    def download_pdf(self, response, protocol_num):
        json_data = response.json()
        b64 = json_data.get(&#39;d&#39;)

        if b64:
            pdf = base64.b64decode(b64)
            filename = f&#39;{protocol_num}.pdf&#39;
            p = os.path.join(self.base_dir, filename)

            if not os.path.isdir(self.base_dir):
                os.mkdir(self.base_dir)

            with open(p, &#39;wb&#39;) as f:
                f.write(pdf)

            self.log(f&quot;Saved {filename} in {self.base_dir}&quot;)
        else:
            self.log(&quot;Couldn&#39;t download pdf&quot;, logging.ERROR)

    @staticmethod
    def get_parameter_by_name(name, url):
        name = name.replace(&#39;[&#39;, &#39;\\[&#39;).replace(&#39;]&#39;, &#39;\\]&#39;)

        results = re.search(r&quot;[?&amp;]&quot; + name + r&quot;(=([^&amp;#]*)|&amp;|#|$)&quot;, url)
        if not results:
            return None
        if len(results.groups()) &lt; 2 or not results[2]:
            return &#39;&#39;

        return unquote(results[2])

How could this situation be resolved?

答案1

得分: 1

我只是找到了解决方案，基于这个答案在另一个线程中。以下行

pdf = base64.b64decode(b64)

已修改为

pdf = base64.b64decode(bytes(b64, 'latin-1'))

现在，所有的PDF文件都能正确下载。

英文:

I just find the solution, based on this answer in another thread. The following line

pdf = base64.b64decode(b64)

is modified to

pdf = base64.b64decode(bytes(b64, &#39;latin-1&#39;))

Now, all the PDF files are correctly downloaded.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

UnicodeEncodeError: 'ascii' codec can't encode character '\xe3' in position 7: ordinal not in range(128)

问题

答案1

匹配除标题外的所有内容。

将 y 轴刻度在 Python 绘图中垂直移动

pyspark 引用不同的数据框架

动态查询 pandas 数据框，以获取满足其他列多个条件为 True 的列的值。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论