英文:
UnicodeDecodeError: 'charmap' when using BeautifulSoup
问题
Hi I'm working with the boot camp 100 Days of code of UDEMY. Currently I am working on the webscraping lesson using BeautifulSoup, however, I have not been able to complete the classes because I am getting a type error that I do not know why is happening and how to solve as the code is very simple. Here, my Python code
from bs4 import BeautifulSoup
with open("website.html") as file:
html_doc = file.read()
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title.name)
Here the error
Traceback (most recent call last):
File "C:\Users\xarss\Desktop0 days of python\Webdev_projects\Websrapingproyect\main.py", line 12, in <module>
html_doc = file.read()
File "C:\Users\xarss\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 281: character maps to <undefined>
I already tried to re-install beautiful soup package and I am still having the same problem and tried using other HTML files and the problem persists.
英文:
Hi I'm working with the boot camp 100 Days of code of UDEMY. Currently I am working on the webscraping lesson using BeautifulSoup, however, I have not been able to complete the classes because I am geting a type error that I do not know why is happening and how to solve as the code is very simple. Here, my phyton code
from bs4 import BeautifulSoup
with open("website.html") as file:
html_doc = file.read()
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title.name)
Here the error
Traceback (most recent call last):
File "C:\Users\xarss\Desktop0 days of python\Webdev_projects\Websrapingproyect\main.py", line 12, in <module>
html_doc = file.read()
File "C:\Users\xarss\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 281: character maps to <undefined>
I already try to re-install beautiful soup package and I am still having the same problem and try using other html files and the problem presist.
<!-- begin snippet: js hide: false console: true babel: false -->
<!-- language: lang-html -->
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Angela's Personal Site</title>
</head>
<body>
<h1 id="name">Angela Yu</h1>
<p><em>Founder of <strong><a href="https://www.appbrewery.co/">The App Brewery</a></strong>.</em></p>
<p>I am an iOS and Web Developer. I ❤️ coffee and motorcycles.</p>
<hr>
<h3 class="heading">Books and Teaching</h3>
<ul>
<li>The Complete iOS App Development Bootcamp</li>
<li>The Complete Web Development Bootcamp</li>
<li>100 Days of Code - The Complete Python Bootcamp</li>
</ul>
<hr>
<h3 class="heading">Other Pages</h3>
<a href="https://angelabauer.github.io/cv/hobbies.html">My Hobbies</a>
<a href="https://angelabauer.github.io/cv/contact-me.html">Contact Me</a>
</body>
</html>
<!-- end snippet -->
答案1
得分: 0
这个错误是由于文件的编码不是 cp1252
导致的(在使用 open
时默认使用 cp1252
编码)。
您需要找出使用的是哪种编码,然后在打开文件时指定它。
在这种情况下,正如您在第5行所看到的,该文件是以 utf-8 编码的:
<meta charset="utf-8">
以下是更新后的代码:
from bs4 import BeautifulSoup
with open("website.html", encoding="utf-8") as file:
soup = BeautifulSoup(file, 'html.parser')
print(soup.title.string)
希望这有所帮助,如果您的问题得到解决,请不要忘记接受答案。
英文:
This error comes from the encoding of the file not being cp1252
(which is used by default when using open
).
You will have to figure out which encoding is used then specify it when opening the file.
In this case, as you can see on line 5, the file is encoded in utf-8 :
<meta charset="utf-8">
Here is the updated code :
from bs4 import BeautifulSoup
with open("website.html", encoding="utf-8") as file:
soup = BeautifulSoup(file, 'html.parser')
print(soup.title.string)
Hope this helps, don't forget to accept an answer if your issue is solved.
答案2
得分: 0
这是一个常见的错误,在打开文件时,如果我们不知道编码方式,就会出现这个错误。
以下是可能有用的方法之一:
with open("website.html", errors="ignore") as file:
with open("website.html", errors='replace') as file:
with open("website.html", 'rb') as file:
英文:
This is a common error which we get while opening a file if we don't know the encoding.
One of the below methods may work.
with open("website.html", errors="ignore") as file:
with open("website.html", errors='replace') as file:
with open("website.html", 'rb') as file:
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论