UnicodeDecodeError: 'charmap' when using BeautifulSoup

huangapple go评论99阅读模式
英文:

UnicodeDecodeError: 'charmap' when using BeautifulSoup

问题

Hi I'm working with the boot camp 100 Days of code of UDEMY. Currently I am working on the webscraping lesson using BeautifulSoup, however, I have not been able to complete the classes because I am getting a type error that I do not know why is happening and how to solve as the code is very simple. Here, my Python code

from bs4 import BeautifulSoup

with open("website.html") as file:
    html_doc = file.read()

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title.name)

Here the error

Traceback (most recent call last):
  File "C:\Users\xarss\Desktop0 days of python\Webdev_projects\Websrapingproyect\main.py", line 12, in <module>
    html_doc = file.read()
  File "C:\Users\xarss\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 281: character maps to <undefined>

I already tried to re-install beautiful soup package and I am still having the same problem and tried using other HTML files and the problem persists.

英文:

Hi I'm working with the boot camp 100 Days of code of UDEMY. Currently I am working on the webscraping lesson using BeautifulSoup, however, I have not been able to complete the classes because I am geting a type error that I do not know why is happening and how to solve as the code is very simple. Here, my phyton code

from bs4 import BeautifulSoup

with open(&quot;website.html&quot;) as file:
    html_doc = file.read()

soup = BeautifulSoup(html_doc, &#39;html.parser&#39;)
print(soup.title.name)

Here the error

Traceback (most recent call last):
  File &quot;C:\Users\xarss\Desktop0 days of python\Webdev_projects\Websrapingproyect\main.py&quot;, line 12, in &lt;module&gt;
    html_doc = file.read()
  File &quot;C:\Users\xarss\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py&quot;, line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: &#39;charmap&#39; codec can&#39;t decode byte 0x9d in position 281: character maps to &lt;undefined&gt;

I already try to re-install beautiful soup package and I am still having the same problem and try using other html files and the problem presist.

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-html -->

&lt;!DOCTYPE html&gt;
&lt;html&gt;

&lt;head&gt;
	&lt;meta charset=&quot;utf-8&quot;&gt;
	&lt;title&gt;Angela&#39;s Personal Site&lt;/title&gt;
&lt;/head&gt;

&lt;body&gt;
	&lt;h1 id=&quot;name&quot;&gt;Angela Yu&lt;/h1&gt;
	&lt;p&gt;&lt;em&gt;Founder of &lt;strong&gt;&lt;a href=&quot;https://www.appbrewery.co/&quot;&gt;The App Brewery&lt;/a&gt;&lt;/strong&gt;.&lt;/em&gt;&lt;/p&gt;
	&lt;p&gt;I am an iOS and Web Developer. I ❤️ coffee and motorcycles.&lt;/p&gt;
	&lt;hr&gt;
	&lt;h3 class=&quot;heading&quot;&gt;Books and Teaching&lt;/h3&gt;
	&lt;ul&gt;
		&lt;li&gt;The Complete iOS App Development Bootcamp&lt;/li&gt;
		&lt;li&gt;The Complete Web Development Bootcamp&lt;/li&gt;
		&lt;li&gt;100 Days of Code - The Complete Python Bootcamp&lt;/li&gt;
	&lt;/ul&gt;
	&lt;hr&gt;
	&lt;h3 class=&quot;heading&quot;&gt;Other Pages&lt;/h3&gt;
	&lt;a href=&quot;https://angelabauer.github.io/cv/hobbies.html&quot;&gt;My Hobbies&lt;/a&gt;
	&lt;a href=&quot;https://angelabauer.github.io/cv/contact-me.html&quot;&gt;Contact Me&lt;/a&gt;
&lt;/body&gt;

&lt;/html&gt;

<!-- end snippet -->

答案1

得分: 0

这个错误是由于文件的编码不是 cp1252 导致的(在使用 open 时默认使用 cp1252 编码)。

您需要找出使用的是哪种编码,然后在打开文件时指定它。

在这种情况下,正如您在第5行所看到的,该文件是以 utf-8 编码的:

<meta charset="utf-8">

以下是更新后的代码:

from bs4 import BeautifulSoup

with open("website.html", encoding="utf-8") as file:
    soup = BeautifulSoup(file, 'html.parser')
    print(soup.title.string)

希望这有所帮助,如果您的问题得到解决,请不要忘记接受答案。

英文:

This error comes from the encoding of the file not being cp1252 (which is used by default when using open).

You will have to figure out which encoding is used then specify it when opening the file.

In this case, as you can see on line 5, the file is encoded in utf-8 :

&lt;meta charset=&quot;utf-8&quot;&gt;

Here is the updated code :

from bs4 import BeautifulSoup

with open(&quot;website.html&quot;, encoding=&quot;utf-8&quot;) as file:

	soup = BeautifulSoup(file, &#39;html.parser&#39;)
	print(soup.title.string)

Hope this helps, don't forget to accept an answer if your issue is solved.

答案2

得分: 0

这是一个常见的错误,在打开文件时,如果我们不知道编码方式,就会出现这个错误。

以下是可能有用的方法之一:

with open("website.html", errors="ignore") as file:

with open("website.html", errors='replace') as file:

with open("website.html", 'rb') as file:

英文:

This is a common error which we get while opening a file if we don't know the encoding.

One of the below methods may work.

with open(&quot;website.html&quot;, errors=&quot;ignore&quot;) as file:

with open(&quot;website.html&quot;, errors=&#39;replace&#39;) as file:

with open(&quot;website.html&quot;, &#39;rb&#39;) as file:

huangapple
  • 本文由 发表于 2023年6月5日 19:57:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76406192.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定