在Flipkart产品上抓取评论时未能获取所有评论?

huangapple go评论120阅读模式
英文:

Not getting all reviews while scraping reviews in flipkart product?

问题

我是新手数据科学家,正在学习如何进行网络爬取。
我正在尝试爬取特定产品的评论,但无法爬取所有评论。

我的代码没有错误,实际上,一开始出现了以下错误,然后我更改了代码。

旧代码:

  1. def perform_scraping(link):
  2. response = requests.get(link)
  3. soup = BeautifulSoup(response.content,"html.parser")
  4. all_div = soup.find_all("div", class_="_1AtVbE col-12-12")
  5. del all_div[0]
  6. for review in range(1,len(all_div)):
  7. rating = all_div[review].find("div", class_ = "_3LWZlK _1BLPMq").text
  8. review_title = all_div[review].find("p","_2-N8zT").text
  9. all_reviews.append((rating,review_title))
  10. all_reviews = []
  11. for page_number in range(1,2):
  12. i = page_number
  13. link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={i}"
  14. perform_scraping(link)
  15. all_reviews

错误:

  1. AttributeError Traceback (most recent call last)
  2. Cell In[3], line 15
  3. 13 i = page_number
  4. 14 link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={i}"
  5. ---> 15 perform_scraping(link)
  6. 16 all_reviews
  7. Cell In[3], line 7, in perform_scraping(link)
  8. 5 del all_div[0]
  9. 6 for review in range(1,len(all_div)):
  10. ----> 7 rating = all_div[review].find("div", class_ = "_3LWZlK _1BLPMq").text
  11. 8 review_title = all_div[review].find("p","_2-N8zT").text
  12. 9 all_reviews.append((rating,review_title))
  13. AttributeError: 'NoneType' object has no attribute 'text'

我分享下面的更新代码,请帮忙。

更新代码:

  1. def perform_scraping(link):
  2. response = requests.get(link)
  3. soup = BeautifulSoup(response.content, "html.parser")
  4. all_div = soup.find_all("div", class_="_1AtVbE col-12-12")
  5. del all_div[0]
  6. all_reviews = []
  7. for review in range(0, len(all_div)):
  8. rating_elem = all_div[review].find("div", class_="_3LWZlK _1BLPMq")
  9. review_title_elem = all_div[review].find("p", class_="_2-N8zT")
  10. if rating_elem is not None and review_title_elem is not None:
  11. rating = rating_elem.text
  12. review_title = review_title_elem.text
  13. all_reviews.append((rating, review_title))
  14. return all_reviews
  15. all_reviews = []
  16. for page_number in range(1, 2):
  17. link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={page_number}"
  18. all_reviews.extend(perform_scraping(link))
  19. print(all_reviews)

请帮忙。

提前感谢。

希望尽快听到你的回复。

英文:

I am new to data science and learning how to do web scrapping.
I am trying to scrap reviews on particular products, but I am not able to scrap the whole reviews.

By code is working fine with no error, actually first it was giving error as bellow then i changed the code .

Old Code:

  1. def perform_scraping(link):
  2. response = requests.get(link)
  3. soup = BeautifulSoup(response.content,"html.parser")
  4. all_div = soup.find_all("div", class_="_1AtVbE col-12-12")
  5. del all_div[0]
  6. for review in range(1,len(all_div)):
  7. rating = all_div[review].find("div", class_ = "_3LWZlK _1BLPMq").text
  8. review_title = all_div[review].find("p","_2-N8zT").text
  9. all_reviews.append((rating,review_title))
  10. all_reviews = []
  11. for page_number in range(1,2):
  12. i = page_number
  13. link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={i}"
  14. perform_scraping(link)
  15. all_reviews

Error:

  1. AttributeError Traceback (most recent call last)
  2. Cell In[3], line 15
  3. 13 i = page_number
  4. 14 link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={i}"
  5. ---> 15 perform_scraping(link)
  6. 16 all_reviews
  7. Cell In[3], line 7, in perform_scraping(link)
  8. 5 del all_div[0]
  9. 6 for review in range(1,len(all_div)):
  10. ----> 7 rating = all_div[review].find("div", class_ = "_3LWZlK _1BLPMq").text
  11. 8 review_title = all_div[review].find("p","_2-N8zT").text
  12. 9 all_reviews.append((rating,review_title))
  13. AttributeError: 'NoneType' object has no attribute 'text'

I am sharing the updated code bellow please help.

Updated Code:

  1. def perform_scraping(link):
  2. response = requests.get(link)
  3. soup = BeautifulSoup(response.content, "html.parser")
  4. all_div = soup.find_all("div", class_="_1AtVbE col-12-12")
  5. del all_div[0]
  6. all_reviews = []
  7. for review in range(0, len(all_div)):
  8. rating_elem = all_div[review].find("div", class_="_3LWZlK _1BLPMq")
  9. review_title_elem = all_div[review].find("p", class_="_2-N8zT")
  10. if rating_elem is not None and review_title_elem is not None:
  11. rating = rating_elem.text
  12. review_title = review_title_elem.text
  13. all_reviews.append((rating, review_title))
  14. return all_reviews
  15. all_reviews = []
  16. for page_number in range(1, 2):
  17. link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={page_number}"
  18. all_reviews.extend(perform_scraping(link))
  19. print(all_reviews)

Please help.

Thanks in advance.

I hope to here from you soon.

答案1

得分: 0

以下是您要求的翻译部分:

原始代码部分:

  1. Taking your example, you aren't traversing all the pages for the reviews. Your loop goes from 1 to 2, where 2 is exclusive, but you are trying to scrape page 2.
  2. def perform_scraping(link):
  3. response = requests.get(link)
  4. soup = BeautifulSoup(response.content, "html.parser")
  5. all_div = soup.find_all("div", class_="_1AtVbE col-12-12")
  6. del all_div[0]
  7. all_reviews = []
  8. for review in range(0, len all_div)):
  9. rating_elem = all_div[review].find("div", class_="_3LWZlK _1BLPMq")
  10. review_title_elem = all_div[review].find("p", class_="_2-N8zT")
  11. if rating_elem is not None and review_title_elem is not None:
  12. rating = rating_elem.text
  13. review_title = review_title_elem.text
  14. all_reviews.append((rating, review_title))
  15. return all_reviews
  16. all_reviews = []
  17. for page_number in range(1, 3):
  18. link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={page_number}"
  19. all_reviews.extend(perform_scraping(link))
  20. print(all_reviews)

更新后的代码部分:

  1. from bs4 import BeautifulSoup
  2. import re
  3. import requests
  4. def perform_scraping(link):
  5. soup = BeautifulSoup(requests.get(link).text, 'lxml')
  6. reviews = []
  7. for review in soup.find_all('div', {'class': '_27M-vq'}):
  8. rating = review.find('div', {'class': re.compile('_3LWZlK .*_1BLPMq')})
  9. review_title = review.find('p', {'class': '_2-N8zT'})
  10. if rating and review_title:
  11. reviews.append((rating.text, review_title.text))
  12. return reviews
  13. all_reviews = []
  14. product_url = 'https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34'
  15. response = requests.get(product_url)
  16. soup = BeautifulSoup(response.text, "html.parser")
  17. if soup.find('div', {'class': '_2MImiq _1Qnn1K'}).find('span', string=re.compile('Page.* of (\d+)')):
  18. pages = int(soup.find('div', {'class': '_2MImiq _1Qnn1K'}).find('span', string=re.compile('Page.* of (\d+)')).text.strip().split(' ')[-1])
  19. for page in range(1, pages+1):
  20. review_page = f"{product_url}?page={page}"
  21. all_reviews.extend(perform_scraping(review_page))
  22. else:
  23. all_reviews.extend(perform_scraping(product_url))
  24. print(all_reviews)

这些是您提供的代码段的翻译部分,没有包含其他内容。

英文:

Taking your example, you aren't traversing all the pages for the reviews. Your loop goes from 1 to 2, where 2 is exclusive, but you are trying to scrape page 2.

  1. def perform_scraping(link):
  2. response = requests.get(link)
  3. soup = BeautifulSoup(response.content, "html.parser")
  4. all_div = soup.find_all("div", class_="_1AtVbE col-12-12")
  5. del all_div[0]
  6. all_reviews = []
  7. for review in range(0, len(all_div)):
  8. rating_elem = all_div[review].find("div", class_="_3LWZlK _1BLPMq")
  9. review_title_elem = all_div[review].find("p", class_="_2-N8zT")
  10. if rating_elem is not None and review_title_elem is not None:
  11. rating = rating_elem.text
  12. review_title = review_title_elem.text
  13. all_reviews.append((rating, review_title))
  14. return all_reviews
  15. all_reviews = []
  16. for page_number in range(1, 3):
  17. link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={page_number}"
  18. all_reviews.extend(perform_scraping(link))
  19. print(all_reviews)

This is your updated code for the same.

And, I have made the code more dynamic for the pages and cleaned it up a bit.

  1. from bs4 import BeautifulSoup
  2. import re
  3. import requests
  4. def perform_scraping(link):
  5. soup = BeautifulSoup(requests.get(link).text, 'lxml')
  6. reviews = []
  7. for review in soup.find_all('div', {'class': '_27M-vq'}):
  8. rating = review.find('div', {'class': re.compile('_3LWZlK .*_1BLPMq')})
  9. review_title = review.find('p', {'class': '_2-N8zT'})
  10. if rating and review_title:
  11. reviews.append((rating.text, review_title.text))
  12. return reviews
  13. all_reviews = []
  14. product_url = 'https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34'
  15. response = requests.get(product_url)
  16. soup = BeautifulSoup(response.text, "html.parser")
  17. if soup.find('div', {'class': '_2MImiq _1Qnn1K'}).find('span', string=re.compile('Page.* of (\d+)')):
  18. pages = int(soup.find('div', {'class': '_2MImiq _1Qnn1K'}).find('span', string=re.compile('Page.* of (\d+)')).text.strip().split(' ')[-1])
  19. for page in range(1, pages+1):
  20. review_page = f"{product_url}?page={page}"
  21. all_reviews.extend(perform_scraping(review_page))
  22. else:
  23. all_reviews.extend(perform_scraping(product_url))
  24. print(all_reviews)

Both of the snippets work the same for the given example, but the latter is dynamic in the sense and can work for any number of pages.

Output (First code):

  1. [('4', 'Worth the money'), ('5', 'Classy product'), ('5', 'Wonderful'), ('5', 'Great product'), ('4', 'Really Nice'), ('5', 'Just wow!'), ('5', 'Highly recommended.'), ('5', 'Brilliant'), ('5', 'Terrific')]

Output (2nd Code):

  1. [('4', 'Worth the money'), ('5', 'Classy product'), ('2', 'Moderate'), ('1', 'Very poor'), ('5', 'Wonderful'), ('5', 'Great product'), ('1', 'Waste of money!'), ('4', 'Really Nice'), ('5', 'Just wow!'), ('1', 'Useless product'), ('5', 'Highly recommended.'), ('5', 'Brilliant'), ('5', 'Terrific'), ('1', 'Utterly Disappointed')]

huangapple
  • 本文由 发表于 2023年7月31日 20:31:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/76803660.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定