如何使用Selenium和BeautifulSoup正确获取网页上的链接?

huangapple go评论76阅读模式
英文:

How to correctly get the links from a webpage using Selenium and BeautifulSoup?

问题

抱歉,这个问题可能无法简单地通过回答解决,因为它涉及到网站的结构和网页爬取的技术细节,而且这段代码中可能存在多个问题。如果你想要找到https://www.foodnetwork.com/recipes页面上的食谱链接,并从每个食谱页面抓取标题、配料和说明,最终将其写入名为"recipe_output.txt"的输出文件,你需要仔细检查代码以解决问题。

以下是关于你提到的问题的一些建议:

  1. 检查网页结构: 首先,请确保网站的结构没有发生变化,导致你的代码无法正常工作。网站结构的更改可能会破坏你的选择器。

  2. 检查选择器: 在寻找食谱链接时,你使用了选择器aclass_="m-MediaBlock__a-HeadlineText"。你可以使用浏览器的开发者工具来检查网页源代码,确认选择器是否匹配链接元素。如果网站结构发生了变化,你可能需要更新选择器。

  3. 等待加载完成: 确保页面已经完全加载完成后再尝试查找链接。你可以使用WebDriver的等待功能来实现这一点,以确保页面中的所有元素都已加载。

  4. 异常处理: 如果在爬取过程中出现异常,例如页面加载超时或元素未找到,添加适当的异常处理机制,以防止整个脚本崩溃。

  5. 输出文件问题: 确保你的输出文件路径是正确的,并且你具有写入文件的权限。你可以添加一些调试输出来检查文件写入是否成功。

  6. 浏览器头部设置: 你的代码中使用了Chrome的无头模式,确保它适用于你的任务。有时候,一些网站可能会检测到无头浏览器并采取防爬虫措施。

  7. 日志和调试: 添加日志和调试语句,以便更好地理解脚本的执行过程,查看每个步骤是否按预期执行。

最终,这段代码可能需要不断调试和测试,以确保其正确性。你可以逐步执行代码,检查每个步骤是否按照预期执行,以找出问题所在。希望这些建议能帮助你解决问题。

英文:

Sorry if this isn't phrased correctly, but basically I'm trying to find the recipe links from https://www.foodnetwork.com/recipes and scrape the title, ingredients and instructions from each recipe page, then write it into an output file called "recipe_output.txt". Here is how I've attempted it:

Set up the webdriver:

options = webdriver.ChromeOptions()
options.add_argument("--headless")  # Run the browser in headless mode (without a GUI)
driver = webdriver.Chrome(options=options)

Define the URL of the recipe page:

listing_url = "https://www.foodnetwork.com/recipes"

Load the page with the webdriver:

driver.get(listing_url)

Get the page's content:

html_content = driver.execute_script("return document.documentElement.outerHTML")

Parse the HTML content with BeautifulSoup:

soup = BeautifulSoup(html_content, "html.parser")

Here is where the issue is, when I try to find the recipe links:

recipe_links = soup.find_all("a", class_="m-MediaBlock__a-HeadlineText")
print("Found {} recipe links".format(len(recipe_links)))

Every time I try to execute, I get the output "Found 0 recipe links" and I am unsure why.

Here are how the elements for the links to each recipe on the recipes page are formatted:

<h3 class="m-MediaBlock__a-Headline">
    <a href="//www.foodnetwork.com/recipes/food-network-kitchen/salad-stuffed-peppers-9970168">
      <span class="m-MediaBlock__a-HeadlineText">Salad-Stuffed Peppers</span>
      
    </a>
  </h3>

I thought I would find all recipes with the "a" tag and the "m-MediaBlock__a-HeadlineText" class in order to get the links, but that is not right.

I've tried other things like using the "a" tag with the "m-MediaBlock__a-Headline" class or the "span" tag with the "m-MediaBlock__a-HeadlineText" class, but none of that works either. Please, if anyone could help me figure out what I'm doing wrong I would very much appreciate it.

Here is the rest of the code for reference:

with open("recipe_output.txt", mode="w", encoding="utf-8") as file:
    for link in recipe_links:
        recipe_url = "https:" + link["href"]
        driver.get(recipe_url)
        recipe_html_content = driver.execute_script("return document.documentElement.outerHTML")
        recipe_soup = BeautifulSoup(recipe_html_content, "html.parser")
        
        title_element = recipe_soup.find("span", class_="o-AssetTitle__a-HeadlineText")
        if title_element is not None:
            recipe_title = title_element.text.strip()
            file.write("Title: " + recipe_title)
            
        ingredient_elements = recipe_soup.find_all("p", class_="o-Ingredients__a-Ingredient")
        ingredients = []
        for ingredient_element in ingredient_elements:
            ingredient_name_element = ingredient_element.find("span", class_="o-Ingredients__a-Ingredient--CheckboxLabel")
            if ingredient_name_element is not None:
                ingredient_name = ingredient_name_element.text.strip()
                
                # Clean the ingredient name
                ingredient_name = re.sub(r"\s*\xa0\s*", " ", ingredient_name)  # Remove "xa0" characters
                if ingredient_name != "Deselect All":
                    ingredients.append(ingredient_name)
                    
        file.write("Ingredients:\n")
        for ingredient in ingredients:
            file.write("- {}\n".format(ingredient))

        instructions_elements = recipe_soup.find_all("li", class_="o-Method__m-Step")
        instructions = [re.sub(r"\s*\xa0\s*", " ", instruction.text.strip()) for instruction in instructions_elements]
        file.write("Instructions:\n")
        for i, instruction in enumerate(instructions, start=1):
            file.write("{}. {}\n".format(i, instruction))
            
        file.write("----------\n")

# Quit the webdriver
driver.quit()

I should also add that nothing is being written into the output file.

答案1

得分: 1

那个类被应用在“a”标签上,而不是“span”标签上。我在你的代码中更改了一行,并获得了64个链接(我认为这是你在寻找的内容):

recipe_links = soup.find_all("span", class_="m-MediaBlock__a-HeadlineText")
英文:

That class is applied not to the "a" tag, but to a "span" tag. I changed one line in your code and got 64 links (which i think is what you're looking for):

recipe_links = soup.find_all("span", class_="m-MediaBlock__a-HeadlineText")

huangapple
  • 本文由 发表于 2023年7月17日 12:08:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76701436.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定