Website cannot be scraped? not giving full source code.

huangapple go评论61阅读模式

Website cannot be scraped? not giving full source code



import requests
from bs4 import BeautifulSoup
import time

# User-Agent 头部
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}

# 发送 HTTP 请求到网页
response = requests.get('', headers=headers)

# 解析网页的 HTML 内容
soup = BeautifulSoup(response.text, 'html.parser')

# 在网页上找到产品信息元素
product_info = soup.find('div', class_='product__info')

if product_info:
  # 从元素中提取产品名称和价格
  name = product_info.find('h1').text
  price = product_info.find('span', class_='price').text




Whenever I print response, it gives me a very short return like it's not able to get the full information form the website.

import requests
from bs4 import BeautifulSoup
import time

# User-Agent header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}

# Send an HTTP request to the webpage
response = requests.get('', headers=headers)

# Parse the HTML content of the webpage
soup = BeautifulSoup(response.text, 'html.parser')

# Find the product information element on the webpage
product_info = soup.find('div', class_='product__info')

if product_info:
  # Extract the product name and price from the element
  name = product_info.find('h1').text
  price = product_info.find('span', class_='price').text

  print(f'Product name: {name}')
  print(f'Product price: {price}')
  print('Product information not found')

I thought maybe they are blocking webscraping, so I should use headers, but nothing has worked. Is there no way to pull information from the website?


得分: 1




解决方法是使用Web驱动程序而不是HTTP请求库。Web驱动程序旨在看起来和表现得就像一个网络浏览器,包括运行嵌入的脚本。Python最流行的Web驱动程序是Selenium WebDriver。Selenium被设计得尽量像一个“真实”的网络浏览器,因此它应该能够查看“真实”的页面。


Turn Javascript off in your web browser and go to the page. That's what requests sees. requests is not a web browser, it's a library that makes HTTP requests. An HTTP request asks a web server for information (often, but not always, in the form of an HTML page) and gives that information back to you.

Many sites nowadays use Javascript to render part or all of their content. That means that your browser queries the initial HTML page via the usual means, and then that page runs some code in your browser to produce the page you see. This all happens incredibly fast, so you don't even notice (if you've got a keen eye, you'll see it when a page starts off as a progress bar or one of those blurred text things and then renders a moment later).

What does this mean for you? When you query the server using requests or curl or some basic HTTP request tool, you get the original site, as well as the code to produce the site a human user would see. But you don't run that code. As an analogy, think about having a delivery service bring you furniture. They go to the store, pick up a box containing all of the pieces of the furniture, and leave that on your doorstep. But you still have to go to the trouble of building it, or else you've just got a box full of wood and an allen wrench.

The solution is to use a web driver rather than an HTTP requests library. A web driver is designed to look and act just like a web browser, including running embedded scripts. The most popular web driver for Python is the Selenium WebDriver. Selenium is designed to behave as much like a "real" web browser as is reasonably possible, so it should be able to see the "true" page.

  • 本文由 发表于 2023年1月6日 11:50:29
  • 转载请务必保留本文链接:



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
