2023年6月6日 11:21:00go评论97阅读模式

英文:

Webscraping have incorrect result

问题

以下是您要翻译的内容：

"When I scrape website it was total correct but have a lot of blank and some of incorrect data."

import scrapy
class AudibleSpider(scrapy.Spider):
    name = 'audible'
    allowed_domains = ['www.audible.com']
    start_urls = ['https://www.audible.com/search/']
    def parse(self, response):
        # 获取包含所需信息（标题、作者、长度）的容器
        product_container = response.xpath('//div[@class="adbl-impression-container "]//ul')
        # 遍历产品容器中的每个产品
        for product in product_container:
            book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
            book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
            book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()
            # 返回提取的数据
            yield {
                'title': book_title,
                'author': book_author,
                'length': book_length,
            }
        pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
        next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
        if next_page_url:
            yield response.follow(url=next_page_url, callback=self.parse)

期望每个页面的每本有声书都有标题、作者和长度作为结果。

结果如下：
[1]: https://i.stack.imgur.com/st2lm.png

英文:

When I scrape website it was total correct but have a lot of blank and some of incorrect data.

import scrapy
class AudibleSpider(scrapy.Spider):
    name = &#39;audible&#39;
    allowed_domains = [&#39;www.audible.com&#39;]
    start_urls = [&#39;https://www.audible.com/search/&#39;]
    def parse(self, response):
        # Getting the box that contains all the info we want (title, author, length)
        product_container = response.xpath(&#39;//div[@class=&quot;adbl-impression-container &quot;]//ul&#39;)
        # Looping through each product listed in the product_container box
        for product in product_container:
            book_title = product.xpath(&#39;.//h3[contains(@class, &quot;bc-heading&quot;)]/a/text()&#39;).get()
            book_author = product.xpath(&#39;.//li[contains(@class, &quot;authorLabel&quot;)]/span/a/text()&#39;).getall()
            book_length = product.xpath(&#39;.//li[contains(@class, &quot;runtimeLabel&quot;)]/span/text()&#39;).get()
            # Return data extracted
            yield {
                &#39;title&#39;: book_title,
                &#39;author&#39;: book_author,
                &#39;length&#39;: book_length,
            }
            
        pagination = response.xpath(&#39;//ul[contains(@class, &quot;pagingElements&quot;)]&#39;)
        next_page_url = pagination.xpath(&#39;.//span[contains(@class, &quot;nextButton&quot;)]/a/@href&#39;).get()
        if next_page_url:
            yield response.follow(url=next_page_url, callback=self.parse)

**Expect to have title,author and length as result of each audio book in every page.
**
**Result is: **
[1]: https://i.stack.imgur.com/st2lm.png

答案1

得分: 1

如果您为产品容器使用更具体的选择器，您将获得所期望的结果。

例如：

 def parse(self, response):
        # 获取包含所需信息（标题、作者、长度）的盒子
        product_container = response.xpath('//div[contains(@class, "bc-col-responsive")]/span//ul')
        # 遍历产品容器中列出的每个产品
        for product in product_container:
            book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
            book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
            book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()
            # 返回提取的数据
            yield {
                'title': book_title,
                'author': book_author,
                'length': book_length,
            }
            
        pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
        next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
        if next_page_url:
            yield response.follow(url=next_page_url, callback=self.parse)

英文:

If you use a more specific selector for your product containers you will achieve the results you are looking for.

For example:

 def parse(self, response):
        # Getting the box that contains all the info we want (title, author, length)
        product_container = response.xpath(&#39;//*[@class=&quot;adbl-impression-container &quot;]//ul//div[contains(@class, &quot;bc-col-responsive&quot;)]/span//ul&#39;)
        # Looping through each product listed in the product_container box
        for product in product_container:
            book_title = product.xpath(&#39;.//h3[contains(@class, &quot;bc-heading&quot;)]/a/text()&#39;).get()
            book_author = product.xpath(&#39;.//li[contains(@class, &quot;authorLabel&quot;)]/span/a/text()&#39;).getall()
            book_length = product.xpath(&#39;.//li[contains(@class, &quot;runtimeLabel&quot;)]/span/text()&#39;).get()
            # Return data extracted
            yield {
                &#39;title&#39;: book_title,
                &#39;author&#39;: book_author,
                &#39;length&#39;: book_length,
            }
            
        pagination = response.xpath(&#39;//ul[contains(@class, &quot;pagingElements&quot;)]&#39;)
        next_page_url = pagination.xpath(&#39;.//span[contains(@class, &quot;nextButton&quot;)]/a/@href&#39;).get()
        if next_page_url:
            yield response.follow(url=next_page_url, callback=self.parse)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

网页抓取结果不正确

问题

答案1

在Python绘图中显示日期范围

为什么`os.CLD_CONTINUED`的Python文档字符串与`int().doc`相同？

无法为我的Discord机器人的斜杠命令添加按钮。

如何在数据框之间执行并行处理？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。