Website cannot be scraped? not giving full source code.

huangapple go评论61阅读模式
英文:

Website cannot be scraped? not giving full source code

问题

每当我打印响应时,它都会返回一个非常简短的内容,就像它无法从网站获取完整的信息一样。

import requests
from bs4 import BeautifulSoup
import time

# User-Agent 头部
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}

# 发送 HTTP 请求到网页
response = requests.get('https://order.mikunisushi.com/menu/mikuni-folsom', headers=headers)

# 解析网页的 HTML 内容
soup = BeautifulSoup(response.text, 'html.parser')

# 在网页上找到产品信息元素
product_info = soup.find('div', class_='product__info')

if product_info:
  # 从元素中提取产品名称和价格
  name = product_info.find('h1').text
  price = product_info.find('span', class_='price').text

  print(f'产品名称:{name}')
  print(f'产品价格:{price}')
else:
  print('未找到产品信息')

我以为可能是他们阻止了网页抓取,所以我应该使用头部信息,但什么都没有起作用。难道没有办法从网站上获取信息吗?

英文:

Whenever I print response, it gives me a very short return like it's not able to get the full information form the website.

import requests
from bs4 import BeautifulSoup
import time

# User-Agent header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}

# Send an HTTP request to the webpage
response = requests.get('https://order.mikunisushi.com/menu/mikuni-folsom', headers=headers)

# Parse the HTML content of the webpage
soup = BeautifulSoup(response.text, 'html.parser')

# Find the product information element on the webpage
product_info = soup.find('div', class_='product__info')

if product_info:
  # Extract the product name and price from the element
  name = product_info.find('h1').text
  price = product_info.find('span', class_='price').text

  print(f'Product name: {name}')
  print(f'Product price: {price}')
else:
  print('Product information not found')

I thought maybe they are blocking webscraping, so I should use headers, but nothing has worked. Is there no way to pull information from the website?

答案1

得分: 1

将JavaScript在你的网络浏览器中关闭并访问该页面。这就是requests所看到的内容。requests不是一个网络浏览器,而是一个用于发起HTTP请求的库。HTTP请求会向网络服务器请求信息(通常是以HTML页面的形式,但并非总是),然后将该信息返回给你。

现在许多网站都使用JavaScript来渲染它们的部分或全部内容。这意味着你的浏览器通过通常的方式查询初始HTML页面,然后该页面在浏览器中运行一些代码来生成你看到的页面。这一切发生得非常快,所以你甚至都不会注意到(如果你很敏锐,当一个页面开始作为进度条或者模糊文本的一部分,然后在片刻后渲染出来时,你会注意到)。

这对你意味着什么?当你使用requestscurl或某个基本的HTTP请求工具查询服务器时,你会获取到原始网站以及用于生成人类用户所看到页面的代码。但你不会运行那些代码。打个比方,想象一下有一个送货服务将家具送到你家。他们去商店,拿起一个装有所有家具零件的箱子,然后把它放在你家门口。但你仍然需要费力地组装它,否则你只是拿到了一个装满木头和扳手的箱子。

解决方法是使用Web驱动程序而不是HTTP请求库。Web驱动程序旨在看起来和表现得就像一个网络浏览器,包括运行嵌入的脚本。Python最流行的Web驱动程序是Selenium WebDriver。Selenium被设计得尽量像一个“真实”的网络浏览器,因此它应该能够查看“真实”的页面。

英文:

Turn Javascript off in your web browser and go to the page. That's what requests sees. requests is not a web browser, it's a library that makes HTTP requests. An HTTP request asks a web server for information (often, but not always, in the form of an HTML page) and gives that information back to you.

Many sites nowadays use Javascript to render part or all of their content. That means that your browser queries the initial HTML page via the usual means, and then that page runs some code in your browser to produce the page you see. This all happens incredibly fast, so you don't even notice (if you've got a keen eye, you'll see it when a page starts off as a progress bar or one of those blurred text things and then renders a moment later).

What does this mean for you? When you query the server using requests or curl or some basic HTTP request tool, you get the original site, as well as the code to produce the site a human user would see. But you don't run that code. As an analogy, think about having a delivery service bring you furniture. They go to the store, pick up a box containing all of the pieces of the furniture, and leave that on your doorstep. But you still have to go to the trouble of building it, or else you've just got a box full of wood and an allen wrench.

The solution is to use a web driver rather than an HTTP requests library. A web driver is designed to look and act just like a web browser, including running embedded scripts. The most popular web driver for Python is the Selenium WebDriver. Selenium is designed to behave as much like a "real" web browser as is reasonably possible, so it should be able to see the "true" page.

huangapple
  • 本文由 发表于 2023年1月6日 11:50:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/75026755.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定