英文:
How to correctly get the links from a webpage using Selenium and BeautifulSoup?
问题
抱歉,这个问题可能无法简单地通过回答解决,因为它涉及到网站的结构和网页爬取的技术细节,而且这段代码中可能存在多个问题。如果你想要找到https://www.foodnetwork.com/recipes页面上的食谱链接,并从每个食谱页面抓取标题、配料和说明,最终将其写入名为"recipe_output.txt"的输出文件,你需要仔细检查代码以解决问题。
以下是关于你提到的问题的一些建议:
-
检查网页结构: 首先,请确保网站的结构没有发生变化,导致你的代码无法正常工作。网站结构的更改可能会破坏你的选择器。
-
检查选择器: 在寻找食谱链接时,你使用了选择器
a
和class_="m-MediaBlock__a-HeadlineText"
。你可以使用浏览器的开发者工具来检查网页源代码,确认选择器是否匹配链接元素。如果网站结构发生了变化,你可能需要更新选择器。 -
等待加载完成: 确保页面已经完全加载完成后再尝试查找链接。你可以使用WebDriver的等待功能来实现这一点,以确保页面中的所有元素都已加载。
-
异常处理: 如果在爬取过程中出现异常,例如页面加载超时或元素未找到,添加适当的异常处理机制,以防止整个脚本崩溃。
-
输出文件问题: 确保你的输出文件路径是正确的,并且你具有写入文件的权限。你可以添加一些调试输出来检查文件写入是否成功。
-
浏览器头部设置: 你的代码中使用了Chrome的无头模式,确保它适用于你的任务。有时候,一些网站可能会检测到无头浏览器并采取防爬虫措施。
-
日志和调试: 添加日志和调试语句,以便更好地理解脚本的执行过程,查看每个步骤是否按预期执行。
最终,这段代码可能需要不断调试和测试,以确保其正确性。你可以逐步执行代码,检查每个步骤是否按照预期执行,以找出问题所在。希望这些建议能帮助你解决问题。
英文:
Sorry if this isn't phrased correctly, but basically I'm trying to find the recipe links from https://www.foodnetwork.com/recipes and scrape the title, ingredients and instructions from each recipe page, then write it into an output file called "recipe_output.txt". Here is how I've attempted it:
Set up the webdriver:
options = webdriver.ChromeOptions()
options.add_argument("--headless") # Run the browser in headless mode (without a GUI)
driver = webdriver.Chrome(options=options)
Define the URL of the recipe page:
listing_url = "https://www.foodnetwork.com/recipes"
Load the page with the webdriver:
driver.get(listing_url)
Get the page's content:
html_content = driver.execute_script("return document.documentElement.outerHTML")
Parse the HTML content with BeautifulSoup:
soup = BeautifulSoup(html_content, "html.parser")
Here is where the issue is, when I try to find the recipe links:
recipe_links = soup.find_all("a", class_="m-MediaBlock__a-HeadlineText")
print("Found {} recipe links".format(len(recipe_links)))
Every time I try to execute, I get the output "Found 0 recipe links" and I am unsure why.
Here are how the elements for the links to each recipe on the recipes page are formatted:
<h3 class="m-MediaBlock__a-Headline">
<a href="//www.foodnetwork.com/recipes/food-network-kitchen/salad-stuffed-peppers-9970168">
<span class="m-MediaBlock__a-HeadlineText">Salad-Stuffed Peppers</span>
</a>
</h3>
I thought I would find all recipes with the "a" tag and the "m-MediaBlock__a-HeadlineText" class in order to get the links, but that is not right.
I've tried other things like using the "a" tag with the "m-MediaBlock__a-Headline" class or the "span" tag with the "m-MediaBlock__a-HeadlineText" class, but none of that works either. Please, if anyone could help me figure out what I'm doing wrong I would very much appreciate it.
Here is the rest of the code for reference:
with open("recipe_output.txt", mode="w", encoding="utf-8") as file:
for link in recipe_links:
recipe_url = "https:" + link["href"]
driver.get(recipe_url)
recipe_html_content = driver.execute_script("return document.documentElement.outerHTML")
recipe_soup = BeautifulSoup(recipe_html_content, "html.parser")
title_element = recipe_soup.find("span", class_="o-AssetTitle__a-HeadlineText")
if title_element is not None:
recipe_title = title_element.text.strip()
file.write("Title: " + recipe_title)
ingredient_elements = recipe_soup.find_all("p", class_="o-Ingredients__a-Ingredient")
ingredients = []
for ingredient_element in ingredient_elements:
ingredient_name_element = ingredient_element.find("span", class_="o-Ingredients__a-Ingredient--CheckboxLabel")
if ingredient_name_element is not None:
ingredient_name = ingredient_name_element.text.strip()
# Clean the ingredient name
ingredient_name = re.sub(r"\s*\xa0\s*", " ", ingredient_name) # Remove "xa0" characters
if ingredient_name != "Deselect All":
ingredients.append(ingredient_name)
file.write("Ingredients:\n")
for ingredient in ingredients:
file.write("- {}\n".format(ingredient))
instructions_elements = recipe_soup.find_all("li", class_="o-Method__m-Step")
instructions = [re.sub(r"\s*\xa0\s*", " ", instruction.text.strip()) for instruction in instructions_elements]
file.write("Instructions:\n")
for i, instruction in enumerate(instructions, start=1):
file.write("{}. {}\n".format(i, instruction))
file.write("----------\n")
# Quit the webdriver
driver.quit()
I should also add that nothing is being written into the output file.
答案1
得分: 1
那个类被应用在“a”标签上,而不是“span”标签上。我在你的代码中更改了一行,并获得了64个链接(我认为这是你在寻找的内容):
recipe_links = soup.find_all("span", class_="m-MediaBlock__a-HeadlineText")
英文:
That class is applied not to the "a" tag, but to a "span" tag. I changed one line in your code and got 64 links (which i think is what you're looking for):
recipe_links = soup.find_all("span", class_="m-MediaBlock__a-HeadlineText")
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论