UnicodeEncodeError: 'ascii' codec can't encode character '\xe3' in position 7: ordinal not in range(128)

huangapple go评论110阅读模式
英文:

UnicodeEncodeError: 'ascii' codec can't encode character '\xe3' in position 7: ordinal not in range(128)

问题

一些天前,我接受了这个回答,将其视为我的问题的正确答案,但过了一会儿,我注意到在某些URL中,我遇到了以下错误:

  1. 2023-05-29 19:22:20 [scrapy.core.scraper] ERROR: Spider error processing <POST https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx/ExibirPDF> (referer: https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx?NumeroProtocoloEntrega=1106380)
  2. Traceback (most recent call last):
  3. File "/home/higo/anaconda3/lib/python3.9/base64.py", line 37, in _bytes_from_decode_data
  4. return s.encode('ascii')
  5. UnicodeEncodeError: 'ascii' codec can't encode character '\xe3' in position 7: ordinal not in range(128)
  6. During handling of the above exception, another exception occurred:
  7. Traceback (most recent call last):
  8. File "/home/higo/anaconda3/lib/python3.9/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
  9. current.result = callback( # type: ignore[misc]
  10. File "/home/higo/Documentos/Doutorado/Artigo/scrape_fatos/scrape_fatos/spiders/fatos.py", line 63, in download_pdf
  11. pdf = base64.b64decode(b64)
  12. File "/home/higo/anaconda3/lib/python3.9/base64.py", line 80, in b64decode
  13. s = _bytes_from_decode_data(s)
  14. File "/home/higo/anaconda3/lib/python3.9/base64.py", line 39, in _bytes_from_decode_data
  15. raise ValueError('string argument should contain only ASCII characters')
  16. ValueError: string argument should contain only ASCII characters

对我来说,这似乎很奇怪,尤其是因为错误消息中提到的URL在正常工作。

我尝试更改download_pdf方法:

  1. def download_pdf(self, response, protocol_num):
  2. json_data = response.json()
  3. b64 = json_data.get('d')
  4. if b64:
  5. # 过滤掉非ASCII字符
  6. filtered_b64 = re.sub(r'[^A-Za-z0-9+/=]', '', b64)
  7. pdf = base64.b64decode(filtered_b64)
  8. filename = f'{protocol_num}.pdf'
  9. p = os.path.join(self.base_dir, filename)
  10. if not os.path.isdir(self.base_dir):
  11. os.mkdir(self.base_dir)
  12. with open(p, 'wb') as f:
  13. f.write(pdf)
  14. self.log(f"Saved {filename} in {self.base_dir}")
  15. else:
  16. self.log("Couldn't download pdf", logging.ERROR)

但我没有成功:通过这种更改,简单地说,所有保存的PDF都损坏了。

经过一些小的更改,我的完整代码如下:

  1. # 这里是您的完整Python代码,包括导入和函数定义部分。
  2. # 如何解决这种情况?

我理解您的代码和问题描述,但不确定您需要哪部分进行翻译。如果您需要任何特定部分的翻译,请告诉我。

英文:

Some days ago, accepted this answer to a question of mine as correct, but after a while I noticed that in some URLs, I got the following error:

  1. 2023-05-29 19:22:20 [scrapy.core.scraper] ERROR: Spider error processing <POST https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx/ExibirPDF> (referer: https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx?NumeroProtocoloEntrega=1106380)
  2. Traceback (most recent call last):
  3. File "/home/higo/anaconda3/lib/python3.9/base64.py", line 37, in _bytes_from_decode_data
  4. return s.encode('ascii')
  5. UnicodeEncodeError: 'ascii' codec can't encode character '\xe3' in position 7: ordinal not in range(128)
  6. During handling of the above exception, another exception occurred:
  7. Traceback (most recent call last):
  8. File "/home/higo/anaconda3/lib/python3.9/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
  9. current.result = callback( # type: ignore[misc]
  10. File "/home/higo/Documentos/Doutorado/Artigo/scrape_fatos/scrape_fatos/spiders/fatos.py", line 63, in download_pdf
  11. pdf = base64.b64decode(b64)
  12. File "/home/higo/anaconda3/lib/python3.9/base64.py", line 80, in b64decode
  13. s = _bytes_from_decode_data(s)
  14. File "/home/higo/anaconda3/lib/python3.9/base64.py", line 39, in _bytes_from_decode_data
  15. raise ValueError('string argument should contain only ASCII characters')
  16. ValueError: string argument should contain only ASCII characters

it seemed strange to me, even more so because the aforementioned URL in the error message works normally.

I tried changing the download_pdf method:

  1. def download_pdf(self, response, protocol_num):
  2. json_data = response.json()
  3. b64 = json_data.get('d')
  4. if b64:
  5. # Filter out non-ASCII characters
  6. filtered_b64 = re.sub(r'[^A-Za-z0-9+/=]', '', b64)
  7. pdf = base64.b64decode(filtered_b64)
  8. filename = f'{protocol_num}.pdf'
  9. p = os.path.join(self.base_dir, filename)
  10. if not os.path.isdir(self.base_dir):
  11. os.mkdir(self.base_dir)
  12. with open(p, 'wb') as f:
  13. f.write(pdf)
  14. self.log(f"Saved {filename} in {self.base_dir}")
  15. else:
  16. self.log("Couldn't download pdf", logging.ERROR)

but I was not successful: with this change, simply all the saved PDFs were corrupted.

After some little changes, my full code is as follows:

  1. import base64
  2. import logging
  3. import os
  4. import re
  5. from urllib.parse import unquote
  6. import scrapy
  7. class FatosSpider(scrapy.Spider):
  8. name = 'fatos'
  9. allowed_domains = ['cvm.gov.br']
  10. with open("urls.txt", "rt") as f:
  11. start_urls = [url.strip() for url in f.readlines()]
  12. base_dir = './pdf_downloads'
  13. def parse(self, response):
  14. id_ = self.get_parameter_by_name("ID", response.url)
  15. if id_:
  16. numeroProtocolo = id_
  17. codInstituicao = 2
  18. else:
  19. numeroProtocolo = self.get_parameter_by_name("NumeroProtocoloEntrega", response.url)
  20. codInstituicao = 1
  21. dataValue = "{ codigoInstituicao: '" + str(codInstituicao) + "', numeroProtocolo: '" + str(numeroProtocolo) + "'"
  22. token = response.xpath('//*[@id="hdnTokenB3"]/@value').get(default='')
  23. versaoCaptcha = ''
  24. if response.xpath('//*[@id="hdnHabilitaCaptcha"]/@value').get(default='') == 'S':
  25. if not token:
  26. versaoCaptcha = 'V3'
  27. payload = dataValue + ", token: '" + token + "', versaoCaptcha: '" + versaoCaptcha + "'}"
  28. url = 'https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx/ExibirPDF'
  29. headers = {
  30. "Accept": "application/json, text/javascript, */*; q=0.01",
  31. "Accept-Encoding": "gzip, deflate, br",
  32. "Accept-Language": "en-US,en;q=0.5",
  33. "Cache-Control": "no-cache",
  34. "Connection": "keep-alive",
  35. "Content-Type": "application/json; charset=utf-8",
  36. "DNT": "1",
  37. "Host": "www.rad.cvm.gov.br",
  38. "Origin": "https://www.rad.cvm.gov.br",
  39. "Pragma": "no-cache",
  40. "Referer": f"https://www.rad.cvm.gov.br/ENET/frmExibirArquivoIPEExterno.aspx?NumeroProtocoloEntrega={numeroProtocolo}",
  41. "Sec-Fetch-Dest": "empty",
  42. "Sec-Fetch-Mode": "cors",
  43. "Sec-Fetch-Site": "same-origin",
  44. "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0",
  45. "X-Requested-With": "XMLHttpRequest"
  46. }
  47. yield scrapy.Request(url=url, headers=headers, body=payload, method='POST', callback=self.download_pdf, cb_kwargs={'protocol_num': numeroProtocolo})
  48. def download_pdf(self, response, protocol_num):
  49. json_data = response.json()
  50. b64 = json_data.get('d')
  51. if b64:
  52. pdf = base64.b64decode(b64)
  53. filename = f'{protocol_num}.pdf'
  54. p = os.path.join(self.base_dir, filename)
  55. if not os.path.isdir(self.base_dir):
  56. os.mkdir(self.base_dir)
  57. with open(p, 'wb') as f:
  58. f.write(pdf)
  59. self.log(f"Saved {filename} in {self.base_dir}")
  60. else:
  61. self.log("Couldn't download pdf", logging.ERROR)
  62. @staticmethod
  63. def get_parameter_by_name(name, url):
  64. name = name.replace('[', '\\[').replace(']', '\\]')
  65. results = re.search(r"[?&]" + name + r"(=([^&#]*)|&|#|$)", url)
  66. if not results:
  67. return None
  68. if len(results.groups()) < 2 or not results[2]:
  69. return ''
  70. return unquote(results[2])

How could this situation be resolved?

答案1

得分: 1

我只是找到了解决方案,基于这个答案在另一个线程中。以下行

pdf = base64.b64decode(b64)

已修改为

pdf = base64.b64decode(bytes(b64, 'latin-1'))

现在,所有的PDF文件都能正确下载。

英文:

I just find the solution, based on this answer in another thread. The following line

  1. pdf = base64.b64decode(b64)

is modified to

  1. pdf = base64.b64decode(bytes(b64, 'latin-1'))

Now, all the PDF files are correctly downloaded.

huangapple
  • 本文由 发表于 2023年5月30日 02:42:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76359681.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定