网站在抓取时提供错误的HTML代码。

huangapple go评论83阅读模式
英文:

Website giving wrong HTML code when scraping

问题

在尝试从网站 (https://www.realcanadiansuperstore.ca) 爬取营养数据时,我注意到无论我尝试什么,都无法从该网站获取正确的 HTML 代码,因此无法获取营养信息。我几乎尝试了我在网上看到的所有方法,但仍然无法让它正常工作,唯一的方法是手动下载每个页面的代码。

fp = Request( 
url='https://www.realcanadiansuperstore.ca/honey-nut-cheerios-breakfast-cereal-family-size-wh/p/21103495_EA', 
headers={'User-Agent': 'Mozilla/5.0'} 
) 

mybytes = urlopen(fp).read()

mystr = mybytes.decode("utf8")
urlopen(fp).close()

print(mystr)

这是我尝试的代码,但它总是返回错误的 HTML 代码。我已经找出了如何从正确的代码中找到营养价值,我只需要找出如何以有效的方式获取代码。请不要因为答案是否显而易见而评判我,我只是几天前开始学习这个哈哈。

英文:

While trying to scrape nutritional data from a website (https://www.realcanadiansuperstore.ca) I noticed that whatever I tried I couldn't get the proper html code from the site, so I couldn't get the nutritional information. I tried almost everything I've seen online but I still can't get it to work, the only way was to manually download the code from every page.

fp = Request( 
url='https://www.realcanadiansuperstore.ca/honey-nut-cheerios-breakfast-cereal-family-size-wh/p/21103495_EA', 
headers={'User-Agent': 'Mozilla/5.0'} 
) 

mybytes = urlopen(fp).read()

mystr = mybytes.decode("utf8") urlopen(fp).close()

print(mystr)

This is the code I tried but it always returns the wrong HTML code. I've already figured out how to find nutritional value from the proper code I just need to figure out how I can get the code in an efficient way. Please don't judge if the answer is obvious I just started learning this a few days ago haha.

答案1

得分: 2

我打算将这写成一个答案,希望我们将来能将其用作重复关闭的依据,因为这已经成为最常问的Python问题之一。

网页

显示网页有两个部分。首先是从Web服务器发送到浏览器的HTML。如果你在浏览器中执行"View Source",你会看到这个,如果你使用 curlwget 拉取页面,也会得到这个,如果你像你现在这样用 requests 读取页面也会得到这个。浏览器解释这个HTML以生成用户界面对象的树,称为 "DOM" 或 "Domain Object Model",DOM 被渲染在屏幕上。

然而,在许多情况下,这个HTML只对你在屏幕上看到的内容负部分责任。大多数网站使用Javascript动态构建屏幕图像的某些部分,有些页面使用Javascript构建所有内容。Javascript 代码通过添加、编辑或删除组件来修改DOM。DOM 只存在于浏览器内存中,形成一个对象树。当你使用浏览器的开发者工具时,你会看到这个DOM树。

如果你在开发者工具中看到一些在 "View Source" HTML 中不存在的组件,那么你能得到这些组件的唯一方法就是使用真正的浏览器,使用诸如 requests-htmlSelenium 这样的包。这些包运行像Chrome或Firefox这样的浏览器,解释Javascript代码,并通过这些API提供你可以在DOM中滚动的方式。

一些人尝试使用 request-html 并只读取 "text" 属性,但那只会给你原始的HTML。浏览器实际上不会维护一个修改过的HTML集,因此你不能简单地下载你看到的最终页面并用BeautifulSoup搜索它。一般来说,如果你需要查询实时的DOM,BeautifulSoup 就没用了,因为它没有要解析的HTML。

英文:

I'm going to write this as an answer, in the hopes that we can use it as a duplicate closer in the future, since this has become one of the most frequently asked Python questions.

Web Pages

There are two parts to displaying a web page. First is the HTML that gets sent from the web server to the browser. This is what you see if you do "View Source" within your browser, it's what you get if you use curl or wget to fetch the page, and it's what you get if you read the page with requests, as you are doing. The browser interprets this HTML to produce a tree of user interface objects, which is called the "DOM" or "Domain Object Model", and the DOM is rendered on the screen.

However, in many cases, this HTML is only partially responsible for what you see on the screen. Most web sites use Javascript to build parts of the screen image on the fly, and some pages use Javascript to build everything. The Javascript code modifies the DOM by adding, editing, or deleting the components. The DOM exists only in the browser's memory as a tree of objects. You see this DOM tree when you use the Developer tools in your browser.

If you are seeing components in your Developer Tools that are not present in the "View Source" HTML, then the only way you can get those components is to use a real browser, using a package like requests-html or Selenium. Those packages run a browser like Chrome or Firefox, which interprets the Javascript code, and provides APIs through which you can scroll through the objects in the DOM.

Some people try to use request-html and just read the "text" attribute, but that only gives you the original HTML. The browser does not actually maintain a modified set of HTML, so you can't simply download the final page you see and search it with BeautifulSoup. As a rule, BeautifulSoup is not useful if you need to query the live DOM, because there's no HTML for it to parse.

huangapple
  • 本文由 发表于 2023年7月6日 10:35:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76625149.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定