英文:
Webscraping have incorrect result
问题
以下是您要翻译的内容:
"When I scrape website it was total correct but have a lot of blank and some of incorrect data."
import scrapy
class AudibleSpider(scrapy.Spider):
name = 'audible'
allowed_domains = ['www.audible.com']
start_urls = ['https://www.audible.com/search/']
def parse(self, response):
# 获取包含所需信息(标题、作者、长度)的容器
product_container = response.xpath('//div[@class="adbl-impression-container "]//ul')
# 遍历产品容器中的每个产品
for product in product_container:
book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()
# 返回提取的数据
yield {
'title': book_title,
'author': book_author,
'length': book_length,
}
pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
if next_page_url:
yield response.follow(url=next_page_url, callback=self.parse)
期望每个页面的每本有声书都有标题、作者和长度作为结果。
结果如下:
[1]: https://i.stack.imgur.com/st2lm.png
英文:
When I scrape website it was total correct but have a lot of blank and some of incorrect data.
import scrapy
class AudibleSpider(scrapy.Spider):
name = 'audible'
allowed_domains = ['www.audible.com']
start_urls = ['https://www.audible.com/search/']
def parse(self, response):
# Getting the box that contains all the info we want (title, author, length)
product_container = response.xpath('//div[@class="adbl-impression-container "]//ul')
# Looping through each product listed in the product_container box
for product in product_container:
book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()
# Return data extracted
yield {
'title': book_title,
'author': book_author,
'length': book_length,
}
pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
if next_page_url:
yield response.follow(url=next_page_url, callback=self.parse)
**Expect to have title,author and length as result of each audio book in every page.
**
**Result is: **
[1]: https://i.stack.imgur.com/st2lm.png
答案1
得分: 1
如果您为产品容器使用更具体的选择器,您将获得所期望的结果。
例如:
def parse(self, response):
# 获取包含所需信息(标题、作者、长度)的盒子
product_container = response.xpath('//div[contains(@class, "bc-col-responsive")]/span//ul')
# 遍历产品容器中列出的每个产品
for product in product_container:
book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()
# 返回提取的数据
yield {
'title': book_title,
'author': book_author,
'length': book_length,
}
pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
if next_page_url:
yield response.follow(url=next_page_url, callback=self.parse)
英文:
If you use a more specific selector for your product containers you will achieve the results you are looking for.
For example:
def parse(self, response):
# Getting the box that contains all the info we want (title, author, length)
product_container = response.xpath('//*[@class="adbl-impression-container "]//ul//div[contains(@class, "bc-col-responsive")]/span//ul')
# Looping through each product listed in the product_container box
for product in product_container:
book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()
# Return data extracted
yield {
'title': book_title,
'author': book_author,
'length': book_length,
}
pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
if next_page_url:
yield response.follow(url=next_page_url, callback=self.parse)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论