问题

我是您的中文翻译，以下是翻译好的代码部分：

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import os

# 设置ChromeDriver可执行文件的路径
chromedriver_path = "C:\\Users\\5mxz2\\Downloads\\chromedriver_win32\\chromedriver"

# 设置Chrome二进制文件的路径
chrome_binary_path = "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe"  # 请更新为正确的Chrome二进制文件路径

# 设置要爬取的Yelp页面的URL
url = "https://www.yelp.com/biz/gelati-celesti-virginia-beach-2"

# 设置Chrome的选项
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")  # 在无头模式下运行Chrome，如果想要查看浏览器窗口，请注释掉此行
chrome_options.binary_location = chrome_binary_path

# 创建ChromeDriver服务
service = Service(chromedriver_path)

# 创建ChromeDriver实例
driver = webdriver.Chrome(service=service, options=chrome_options)

# 加载Yelp页面
driver.get(url)

# 等待加载评论
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".border-color--default__09f24__NPAKY")))

# 提取页面源代码并传递给BeautifulSoup
soup = BeautifulSoup(driver.page_source, "html.parser")

# 在页面上找到所有评论元素
reviews = soup.find_all("div", class_="border-color--default__09f24__NPAKY")

# 创建空列表以存储提取的数据
review_texts = []
ratings = []
dates = []

# 遍历每个评论元素
for review in reviews:
    # 提取评论文本
    review_text_element = review.find("div", class_="margin-b2__09f24__CEMjT.border-color--default__09f24__NPAKY")
    review_text = review_text_element.get_text() if review_text_element else ""
    review_texts.append(review_text.strip())

    # 提取评分
    rating_element = review.find("div", class_="five-stars__09f24__mBKym.five-stars--regular__09f24__DgBNj.display--inline-block__09f24__fEDiJ.border-color--default__09f24__NPAKY")
    rating = rating_element.get("aria-label") if rating_element else ""
    ratings.append(rating)

    # 提取日期
    date_element = review.find("span", class_="css-chan6m")
    date = date_element.get_text() if date_element else ""
    dates.append(date.strip())

# 从提取的数据创建DataFrame
data = {
    "评论文本": review_texts,
    "评分": ratings,
    "日期": dates
}
df = pd.DataFrame(data)

# 打印DataFrame
print(df)

# 获取当前工作目录
path = os.getcwd()

# 将DataFrame保存为CSV文件
csv_path = os.path.join(path, "yelp_reviews.csv")
df.to_csv(csv_path, index=False)

# 关闭ChromeDriver实例
driver.quit()

这是您提供的代码的翻译版本。如果您有任何其他问题或需要进一步帮助，请随时提问。

英文:

So I'm fairly new to coding and I am supposed to be parsing Yelp reviews so I can analyze the data using Pandas. I have been trying to use selenium/beautifulsoup to automate the whole process and I was able to get past the chrome/webdriver issues by running it on my local machine. It technically "works" now but no data is displayed in the output. I feel like I've tried everything,
can someone please tell me what I'm doing wrong?
I feel like it could be a html tag class issue with the actual URL in the code but I am not sure which ones to use and it's weird to me how there's only 47 reviews on this particular business page but there's 1384 rows in the created CSV file.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import os
# Set the path to the ChromeDriver executable
chromedriver_path = &quot;C:\\Users\\5mxz2\\Downloads\\chromedriver_win32\\chromedriver&quot;
# Set the path to the Chrome binary
chrome_binary_path = &quot;C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe&quot;  # Update this with the correct path to your Chrome binary
# Set the URL of the Yelp page you want to scrape
url = &quot;https://www.yelp.com/biz/gelati-celesti-virginia-beach-2&quot;
# Set the options for Chrome
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(&quot;--headless&quot;)  # Run Chrome in headless mode, comment this line if you want to see the browser window
chrome_options.binary_location = chrome_binary_path
# Create the ChromeDriver service
service = Service(chromedriver_path)
# Create the ChromeDriver instance
driver = webdriver.Chrome(service=service, options=chrome_options)
# Load the Yelp page
driver.get(url)
# Wait for the reviews to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, &quot;.border-color--default__09f24__NPAKY&quot;)))
# Extract the page source and pass it to BeautifulSoup
soup = BeautifulSoup(driver.page_source, &quot;html.parser&quot;)
# Find all review elements on the page
reviews = soup.find_all(&quot;div&quot;, class_=&quot;border-color--default__09f24__NPAKY&quot;)
# Create empty lists to store the extracted data
review_texts = []
ratings = []
dates = []
# Iterate over each review element
for review in reviews:
# Extract the review text
review_text_element = review.find(&quot;div&quot;, class_=&quot;margin-b2__09f24__CEMjT.border-color--default__09f24__NPAKY&quot;)
review_text = review_text_element.get_text() if review_text_element else &quot;&quot;
review_texts.append(review_text.strip())
# Extract the rating
rating_element = review.find(&quot;div&quot;, class_=&quot;five-stars__09f24__mBKym.five-stars--regular__09f24__DgBNj.display--inline-block__09f24__fEDiJ.border-color--default__09f24__NPAKY&quot;)
rating = rating_element.get(&quot;aria-label&quot;) if rating_element else &quot;&quot;
ratings.append(rating)
# Extract the date
date_element = review.find(&quot;span&quot;, class_=&quot;css-chan6m&quot;)
date = date_element.get_text() if date_element else &quot;&quot;
dates.append(date.strip())
# Create a DataFrame from the extracted data
data = {
&quot;Review Text&quot;: review_texts,
&quot;Rating&quot;: ratings,
&quot;Date&quot;: dates
}
df = pd.DataFrame(data)
# Print the DataFrame
print(df)
# Get the current working directory
path = os.getcwd()
# Save the DataFrame as a CSV file
csv_path = os.path.join(path, &quot;yelp_reviews.csv&quot;)
df.to_csv(csv_path, index=False)
# Close the ChromeDriver instance
driver.quit()

Here are some additional pictures and I just noticed that there was some information printed in the date column of the csv file, but they seemed randomly placed and not all of them are actually dates.

答案1

得分: 0

以下是翻译好的部分：

我已经使用 requests 重写了代码，因为 Selenium 有不必要的开销。

from bs4 import BeautifulSoup as bs
import pandas as pd
import requests

restaurant_url = 'https://www.yelp.com/biz/gelati-celesti-virginia-beach-2'
headers = {
    'host': 'www.yelp.com'
}

restaurant_page = bs(requests.get(restaurant_url, headers=headers).text, 'lxml')
biz_id = restaurant_page.find('meta', {'name': 'yelp-biz-id'}).get('content')
review_count = int(restaurant_page.find('a', {'href': '#reviews'}).text.split(' ')[0])

data = []

for review_page in range(0, review_count, 10): # 10 reviews per page
    review_api_url = f'https://www.yelp.com/biz/{biz_id}/review_feed?rl=en&amp;q=&amp;sort_by=relevance_desc&amp;start={review_page}'

    for review in requests.get(review_api_url, headers=headers).json()['reviews']:
        data.append({
            'Review Text': review['comment']['text'],
            'Rating': review['rating'],
            'Date': review['localizedDate']
        })
        print(data[-1])

pd.DataFrame(data).to_csv('Yelp Review.csv', index=None)

在这段代码中，我从餐厅页面获取了商业ID（biz-id）和评论总数，并在 Yelp API 中使用它来获取所有评论，最后保存在CSV文件中。

保存的CSV文件的示例输出如下：

英文:

I have rewritten the code to do the same thing using requests, as selenium has unnecessary overhead.

from bs4 import BeautifulSoup as bs
import pandas as pd
import requests
restaurant_url = &#39;https://www.yelp.com/biz/gelati-celesti-virginia-beach-2&#39;
headers = {
&#39;host&#39;: &#39;www.yelp.com&#39;
}
restaurant_page = bs(requests.get(restaurant_url, headers=headers).text, &#39;lxml&#39;)
biz_id = restaurant_page.find(&#39;meta&#39;, {&#39;name&#39;: &#39;yelp-biz-id&#39;}).get(&#39;content&#39;)
review_count = int(restaurant_page.find(&#39;a&#39;, {&#39;href&#39;: &#39;#reviews&#39;}).text.split(&#39; &#39;)[0]) 
data = []
for review_page in range(0, review_count, 10): # 10 reviews per page
review_api_url = f&#39;https://www.yelp.com/biz/{biz_id}/review_feed?rl=en&amp;q=&amp;sort_by=relevance_desc&amp;start={review_page}&#39;
for review in requests.get(review_api_url, headers=headers).json()[&#39;reviews&#39;]:
data.append({
&#39;Review Text&#39;: review[&#39;comment&#39;][&#39;text&#39;],
&#39;Rating&#39;: review[&#39;rating&#39;],
&#39;Date&#39;: review[&#39;localizedDate&#39;]
})
print(data[-1])
pd.DataFrame(data).to_csv(&#39;Yelp Review.csv&#39;, index=None)

In this code, I am getting the business id (biz-id) and the total number of reviews from the restaurant page and using it in Yelp API to get all the reviews, saving it in a CSV at the end.

Sample output of the saved CSV is:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在使用Python进行网页抓取时未找到数据？

问题

答案1

你需要改变什么，以便我的龙卷风代码可以成功发布？

Auto-switching Python Virtual Environments in Visual Studio Code per Directory within a Workspace.

OpenPose在低分辨率图像上？

Export NumPy array as NetCDF4 file

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论