网页抓取结果不正确

huangapple go评论97阅读模式
英文:

Webscraping have incorrect result

问题

以下是您要翻译的内容:

"When I scrape website it was total correct but have a lot of blank and some of incorrect data."

  1. import scrapy
  2. class AudibleSpider(scrapy.Spider):
  3. name = 'audible'
  4. allowed_domains = ['www.audible.com']
  5. start_urls = ['https://www.audible.com/search/']
  6. def parse(self, response):
  7. # 获取包含所需信息(标题、作者、长度)的容器
  8. product_container = response.xpath('//div[@class="adbl-impression-container "]//ul')
  9. # 遍历产品容器中的每个产品
  10. for product in product_container:
  11. book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
  12. book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
  13. book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()
  14. # 返回提取的数据
  15. yield {
  16. 'title': book_title,
  17. 'author': book_author,
  18. 'length': book_length,
  19. }
  20. pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
  21. next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
  22. if next_page_url:
  23. yield response.follow(url=next_page_url, callback=self.parse)

期望每个页面的每本有声书都有标题、作者和长度作为结果。

结果如下:
[1]: https://i.stack.imgur.com/st2lm.png

英文:

When I scrape website it was total correct but have a lot of blank and some of incorrect data.

  1. import scrapy
  2. class AudibleSpider(scrapy.Spider):
  3. name = 'audible'
  4. allowed_domains = ['www.audible.com']
  5. start_urls = ['https://www.audible.com/search/']
  6. def parse(self, response):
  7. # Getting the box that contains all the info we want (title, author, length)
  8. product_container = response.xpath('//div[@class="adbl-impression-container "]//ul')
  9. # Looping through each product listed in the product_container box
  10. for product in product_container:
  11. book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
  12. book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
  13. book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()
  14. # Return data extracted
  15. yield {
  16. 'title': book_title,
  17. 'author': book_author,
  18. 'length': book_length,
  19. }
  20. pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
  21. next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
  22. if next_page_url:
  23. yield response.follow(url=next_page_url, callback=self.parse)

**Expect to have title,author and length as result of each audio book in every page.
**
**Result is: **
[1]: https://i.stack.imgur.com/st2lm.png

答案1

得分: 1

如果您为产品容器使用更具体的选择器,您将获得所期望的结果。

例如:

  1. def parse(self, response):
  2. # 获取包含所需信息(标题、作者、长度)的盒子
  3. product_container = response.xpath('//div[contains(@class, "bc-col-responsive")]/span//ul')
  4. # 遍历产品容器中列出的每个产品
  5. for product in product_container:
  6. book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
  7. book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
  8. book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()
  9. # 返回提取的数据
  10. yield {
  11. 'title': book_title,
  12. 'author': book_author,
  13. 'length': book_length,
  14. }
  15. pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
  16. next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
  17. if next_page_url:
  18. yield response.follow(url=next_page_url, callback=self.parse)
英文:

If you use a more specific selector for your product containers you will achieve the results you are looking for.

For example:

  1. def parse(self, response):
  2. # Getting the box that contains all the info we want (title, author, length)
  3. product_container = response.xpath('//*[@class="adbl-impression-container "]//ul//div[contains(@class, "bc-col-responsive")]/span//ul')
  4. # Looping through each product listed in the product_container box
  5. for product in product_container:
  6. book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
  7. book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
  8. book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()
  9. # Return data extracted
  10. yield {
  11. 'title': book_title,
  12. 'author': book_author,
  13. 'length': book_length,
  14. }
  15. pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
  16. next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
  17. if next_page_url:
  18. yield response.follow(url=next_page_url, callback=self.parse)

huangapple
  • 本文由 发表于 2023年6月6日 11:21:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76411253.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定