2023年6月5日 19:57:57go评论105阅读模式

英文:

UnicodeDecodeError: 'charmap' when using BeautifulSoup

问题

Hi I'm working with the boot camp 100 Days of code of UDEMY. Currently I am working on the webscraping lesson using BeautifulSoup, however, I have not been able to complete the classes because I am getting a type error that I do not know why is happening and how to solve as the code is very simple. Here, my Python code

from bs4 import BeautifulSoup

with open("website.html") as file:
    html_doc = file.read()

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title.name)

Here the error

Traceback (most recent call last):
  File "C:\Users\xarss\Desktop0 days of python\Webdev_projects\Websrapingproyect\main.py", line 12, in <module>
    html_doc = file.read()
  File "C:\Users\xarss\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 281: character maps to <undefined>

I already tried to re-install beautiful soup package and I am still having the same problem and tried using other HTML files and the problem persists.

英文:

Hi I'm working with the boot camp 100 Days of code of UDEMY. Currently I am working on the webscraping lesson using BeautifulSoup, however, I have not been able to complete the classes because I am geting a type error that I do not know why is happening and how to solve as the code is very simple. Here, my phyton code

from bs4 import BeautifulSoup

with open(&quot;website.html&quot;) as file:
    html_doc = file.read()

soup = BeautifulSoup(html_doc, &#39;html.parser&#39;)
print(soup.title.name)

Here the error

Traceback (most recent call last):
  File &quot;C:\Users\xarss\Desktop0 days of python\Webdev_projects\Websrapingproyect\main.py&quot;, line 12, in &lt;module&gt;
    html_doc = file.read()
  File &quot;C:\Users\xarss\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py&quot;, line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: &#39;charmap&#39; codec can&#39;t decode byte 0x9d in position 281: character maps to &lt;undefined&gt;

I already try to re-install beautiful soup package and I am still having the same problem and try using other html files and the problem presist.

&lt;!DOCTYPE html&gt;
&lt;html&gt;

&lt;head&gt;
	&lt;meta charset=&quot;utf-8&quot;&gt;
	&lt;title&gt;Angela&#39;s Personal Site&lt;/title&gt;
&lt;/head&gt;

&lt;body&gt;
	&lt;h1 id=&quot;name&quot;&gt;Angela Yu&lt;/h1&gt;
	&lt;p&gt;&lt;em&gt;Founder of &lt;strong&gt;&lt;a href=&quot;https://www.appbrewery.co/&quot;&gt;The App Brewery&lt;/a&gt;&lt;/strong&gt;.&lt;/em&gt;&lt;/p&gt;
	&lt;p&gt;I am an iOS and Web Developer. I ❤️ coffee and motorcycles.&lt;/p&gt;
	&lt;hr&gt;
	&lt;h3 class=&quot;heading&quot;&gt;Books and Teaching&lt;/h3&gt;
	&lt;ul&gt;
		&lt;li&gt;The Complete iOS App Development Bootcamp&lt;/li&gt;
		&lt;li&gt;The Complete Web Development Bootcamp&lt;/li&gt;
		&lt;li&gt;100 Days of Code - The Complete Python Bootcamp&lt;/li&gt;
	&lt;/ul&gt;
	&lt;hr&gt;
	&lt;h3 class=&quot;heading&quot;&gt;Other Pages&lt;/h3&gt;
	&lt;a href=&quot;https://angelabauer.github.io/cv/hobbies.html&quot;&gt;My Hobbies&lt;/a&gt;
	&lt;a href=&quot;https://angelabauer.github.io/cv/contact-me.html&quot;&gt;Contact Me&lt;/a&gt;
&lt;/body&gt;

&lt;/html&gt;

答案1

得分: 0

这个错误是由于文件的编码不是 cp1252 导致的（在使用 open 时默认使用 cp1252 编码）。

您需要找出使用的是哪种编码，然后在打开文件时指定它。

在这种情况下，正如您在第5行所看到的，该文件是以 utf-8 编码的：

<meta charset="utf-8">

以下是更新后的代码：

from bs4 import BeautifulSoup

with open("website.html", encoding="utf-8") as file:
    soup = BeautifulSoup(file, 'html.parser')
    print(soup.title.string)

希望这有所帮助，如果您的问题得到解决，请不要忘记接受答案。

英文:

This error comes from the encoding of the file not being cp1252 (which is used by default when using open).

You will have to figure out which encoding is used then specify it when opening the file.

In this case, as you can see on line 5, the file is encoded in utf-8 :

&lt;meta charset=&quot;utf-8&quot;&gt;

Here is the updated code :

from bs4 import BeautifulSoup

with open(&quot;website.html&quot;, encoding=&quot;utf-8&quot;) as file:

	soup = BeautifulSoup(file, &#39;html.parser&#39;)
	print(soup.title.string)

Hope this helps, don't forget to accept an answer if your issue is solved.

答案2

得分: 0

这是一个常见的错误，在打开文件时，如果我们不知道编码方式，就会出现这个错误。

以下是可能有用的方法之一：

with open("website.html", errors="ignore") as file:

with open("website.html", errors='replace') as file:

with open("website.html", 'rb') as file:

英文:

This is a common error which we get while opening a file if we don't know the encoding.

One of the below methods may work.

with open(&quot;website.html&quot;, errors=&quot;ignore&quot;) as file:

with open(&quot;website.html&quot;, errors=&#39;replace&#39;) as file:

with open(&quot;website.html&quot;, &#39;rb&#39;) as file:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

UnicodeDecodeError: 'charmap' when using BeautifulSoup

问题

答案1

答案2

Groupby, Window and rolling average in Spark

Python：检查文件夹中是否有超过 x 个文件的最快方式

为什么 pandas.series.str.extract 在这里不起作用，但在其他地方起作用？

applying Singleton to Spotipy throws error

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论