2023年7月17日 12:08:29go评论76阅读模式

英文:

How to correctly get the links from a webpage using Selenium and BeautifulSoup?

问题

抱歉，这个问题可能无法简单地通过回答解决，因为它涉及到网站的结构和网页爬取的技术细节，而且这段代码中可能存在多个问题。如果你想要找到https://www.foodnetwork.com/recipes页面上的食谱链接，并从每个食谱页面抓取标题、配料和说明，最终将其写入名为"recipe_output.txt"的输出文件，你需要仔细检查代码以解决问题。

以下是关于你提到的问题的一些建议：

检查网页结构： 首先，请确保网站的结构没有发生变化，导致你的代码无法正常工作。网站结构的更改可能会破坏你的选择器。
检查选择器： 在寻找食谱链接时，你使用了选择器a和class_="m-MediaBlock__a-HeadlineText"。你可以使用浏览器的开发者工具来检查网页源代码，确认选择器是否匹配链接元素。如果网站结构发生了变化，你可能需要更新选择器。
等待加载完成： 确保页面已经完全加载完成后再尝试查找链接。你可以使用WebDriver的等待功能来实现这一点，以确保页面中的所有元素都已加载。
异常处理： 如果在爬取过程中出现异常，例如页面加载超时或元素未找到，添加适当的异常处理机制，以防止整个脚本崩溃。
输出文件问题： 确保你的输出文件路径是正确的，并且你具有写入文件的权限。你可以添加一些调试输出来检查文件写入是否成功。
浏览器头部设置： 你的代码中使用了Chrome的无头模式，确保它适用于你的任务。有时候，一些网站可能会检测到无头浏览器并采取防爬虫措施。
日志和调试： 添加日志和调试语句，以便更好地理解脚本的执行过程，查看每个步骤是否按预期执行。

最终，这段代码可能需要不断调试和测试，以确保其正确性。你可以逐步执行代码，检查每个步骤是否按照预期执行，以找出问题所在。希望这些建议能帮助你解决问题。

英文:

Sorry if this isn't phrased correctly, but basically I'm trying to find the recipe links from https://www.foodnetwork.com/recipes and scrape the title, ingredients and instructions from each recipe page, then write it into an output file called "recipe_output.txt". Here is how I've attempted it:

Set up the webdriver:

options = webdriver.ChromeOptions()
options.add_argument(&quot;--headless&quot;)  # Run the browser in headless mode (without a GUI)
driver = webdriver.Chrome(options=options)

Define the URL of the recipe page:

listing_url = &quot;https://www.foodnetwork.com/recipes&quot;

Load the page with the webdriver:

driver.get(listing_url)

Get the page's content:

html_content = driver.execute_script(&quot;return document.documentElement.outerHTML&quot;)

Parse the HTML content with BeautifulSoup:

soup = BeautifulSoup(html_content, &quot;html.parser&quot;)

Here is where the issue is, when I try to find the recipe links:

recipe_links = soup.find_all(&quot;a&quot;, class_=&quot;m-MediaBlock__a-HeadlineText&quot;)
print(&quot;Found {} recipe links&quot;.format(len(recipe_links)))

Every time I try to execute, I get the output "Found 0 recipe links" and I am unsure why.

Here are how the elements for the links to each recipe on the recipes page are formatted:

&lt;h3 class=&quot;m-MediaBlock__a-Headline&quot;&gt;
    &lt;a href=&quot;//www.foodnetwork.com/recipes/food-network-kitchen/salad-stuffed-peppers-9970168&quot;&gt;
      &lt;span class=&quot;m-MediaBlock__a-HeadlineText&quot;&gt;Salad-Stuffed Peppers&lt;/span&gt;
      
    &lt;/a&gt;
  &lt;/h3&gt;

I thought I would find all recipes with the "a" tag and the "m-MediaBlock__a-HeadlineText" class in order to get the links, but that is not right.

I've tried other things like using the "a" tag with the "m-MediaBlock__a-Headline" class or the "span" tag with the "m-MediaBlock__a-HeadlineText" class, but none of that works either. Please, if anyone could help me figure out what I'm doing wrong I would very much appreciate it.

Here is the rest of the code for reference:

with open(&quot;recipe_output.txt&quot;, mode=&quot;w&quot;, encoding=&quot;utf-8&quot;) as file:
    for link in recipe_links:
        recipe_url = &quot;https:&quot; + link[&quot;href&quot;]
        driver.get(recipe_url)
        recipe_html_content = driver.execute_script(&quot;return document.documentElement.outerHTML&quot;)
        recipe_soup = BeautifulSoup(recipe_html_content, &quot;html.parser&quot;)
        
        title_element = recipe_soup.find(&quot;span&quot;, class_=&quot;o-AssetTitle__a-HeadlineText&quot;)
        if title_element is not None:
            recipe_title = title_element.text.strip()
            file.write(&quot;Title: &quot; + recipe_title)
            
        ingredient_elements = recipe_soup.find_all(&quot;p&quot;, class_=&quot;o-Ingredients__a-Ingredient&quot;)
        ingredients = []
        for ingredient_element in ingredient_elements:
            ingredient_name_element = ingredient_element.find(&quot;span&quot;, class_=&quot;o-Ingredients__a-Ingredient--CheckboxLabel&quot;)
            if ingredient_name_element is not None:
                ingredient_name = ingredient_name_element.text.strip()
                
                # Clean the ingredient name
                ingredient_name = re.sub(r&quot;\s*\xa0\s*&quot;, &quot; &quot;, ingredient_name)  # Remove &quot;xa0&quot; characters
                if ingredient_name != &quot;Deselect All&quot;:
                    ingredients.append(ingredient_name)
                    
        file.write(&quot;Ingredients:\n&quot;)
        for ingredient in ingredients:
            file.write(&quot;- {}\n&quot;.format(ingredient))

        instructions_elements = recipe_soup.find_all(&quot;li&quot;, class_=&quot;o-Method__m-Step&quot;)
        instructions = [re.sub(r&quot;\s*\xa0\s*&quot;, &quot; &quot;, instruction.text.strip()) for instruction in instructions_elements]
        file.write(&quot;Instructions:\n&quot;)
        for i, instruction in enumerate(instructions, start=1):
            file.write(&quot;{}. {}\n&quot;.format(i, instruction))
            
        file.write(&quot;----------\n&quot;)

# Quit the webdriver
driver.quit()

I should also add that nothing is being written into the output file.

答案1

得分: 1

那个类被应用在“a”标签上，而不是“span”标签上。我在你的代码中更改了一行，并获得了64个链接（我认为这是你在寻找的内容）：

recipe_links = soup.find_all("span", class_="m-MediaBlock__a-HeadlineText")

英文:

That class is applied not to the "a" tag, but to a "span" tag. I changed one line in your code and got 64 links (which i think is what you're looking for):

recipe_links = soup.find_all(&quot;span&quot;, class_=&quot;m-MediaBlock__a-HeadlineText&quot;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Selenium和BeautifulSoup正确获取网页上的链接？

问题

答案1

在PyTorch中快速多次压缩的方法？

httr将工作中的Python连接翻译为R。

如何纠正拼写错误的产品名称？

Pulp匹配算法替换贪婪算法

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论