网页抓取结果不正确

huangapple go评论73阅读模式
英文:

Webscraping have incorrect result

问题

以下是您要翻译的内容:

"When I scrape website it was total correct but have a lot of blank and some of incorrect data."

import scrapy

class AudibleSpider(scrapy.Spider):
    name = 'audible'
    allowed_domains = ['www.audible.com']
    start_urls = ['https://www.audible.com/search/']

    def parse(self, response):
        # 获取包含所需信息(标题、作者、长度)的容器
        product_container = response.xpath('//div[@class="adbl-impression-container "]//ul')

        # 遍历产品容器中的每个产品
        for product in product_container:
            book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
            book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
            book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()

            # 返回提取的数据
            yield {
                'title': book_title,
                'author': book_author,
                'length': book_length,
            }

        pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
        next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
        if next_page_url:
            yield response.follow(url=next_page_url, callback=self.parse)

期望每个页面的每本有声书都有标题、作者和长度作为结果。

结果如下:
[1]: https://i.stack.imgur.com/st2lm.png

英文:

When I scrape website it was total correct but have a lot of blank and some of incorrect data.

import scrapy

class AudibleSpider(scrapy.Spider):
    name = 'audible'
    allowed_domains = ['www.audible.com']
    start_urls = ['https://www.audible.com/search/']

    def parse(self, response):
        # Getting the box that contains all the info we want (title, author, length)
        product_container = response.xpath('//div[@class="adbl-impression-container "]//ul')

        # Looping through each product listed in the product_container box
        for product in product_container:
            book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
            book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
            book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()

            # Return data extracted
            yield {
                'title': book_title,
                'author': book_author,
                'length': book_length,
            }
            

        pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
        next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
        if next_page_url:
            yield response.follow(url=next_page_url, callback=self.parse)

**Expect to have title,author and length as result of each audio book in every page.
**
**Result is: **
[1]: https://i.stack.imgur.com/st2lm.png

答案1

得分: 1

如果您为产品容器使用更具体的选择器,您将获得所期望的结果。

例如:

 def parse(self, response):
        # 获取包含所需信息(标题、作者、长度)的盒子
        product_container = response.xpath('//div[contains(@class, "bc-col-responsive")]/span//ul')

        # 遍历产品容器中列出的每个产品
        for product in product_container:
            book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
            book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
            book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()

            # 返回提取的数据
            yield {
                'title': book_title,
                'author': book_author,
                'length': book_length,
            }
            

        pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
        next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
        if next_page_url:
            yield response.follow(url=next_page_url, callback=self.parse)
英文:

If you use a more specific selector for your product containers you will achieve the results you are looking for.

For example:

 def parse(self, response):
        # Getting the box that contains all the info we want (title, author, length)
        product_container = response.xpath('//*[@class="adbl-impression-container "]//ul//div[contains(@class, "bc-col-responsive")]/span//ul')

        # Looping through each product listed in the product_container box
        for product in product_container:
            book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
            book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
            book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()

            # Return data extracted
            yield {
                'title': book_title,
                'author': book_author,
                'length': book_length,
            }
            

        pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
        next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
        if next_page_url:
            yield response.follow(url=next_page_url, callback=self.parse)

huangapple
  • 本文由 发表于 2023年6月6日 11:21:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76411253.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定