英文:
Not getting all reviews while scraping reviews in flipkart product?
问题
我是新手数据科学家,正在学习如何进行网络爬取。
我正在尝试爬取特定产品的评论,但无法爬取所有评论。
我的代码没有错误,实际上,一开始出现了以下错误,然后我更改了代码。
旧代码:
def perform_scraping(link):
response = requests.get(link)
soup = BeautifulSoup(response.content,"html.parser")
all_div = soup.find_all("div", class_="_1AtVbE col-12-12")
del all_div[0]
for review in range(1,len(all_div)):
rating = all_div[review].find("div", class_ = "_3LWZlK _1BLPMq").text
review_title = all_div[review].find("p","_2-N8zT").text
all_reviews.append((rating,review_title))
all_reviews = []
for page_number in range(1,2):
i = page_number
link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={i}"
perform_scraping(link)
all_reviews
错误:
AttributeError Traceback (most recent call last)
Cell In[3], line 15
13 i = page_number
14 link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={i}"
---> 15 perform_scraping(link)
16 all_reviews
Cell In[3], line 7, in perform_scraping(link)
5 del all_div[0]
6 for review in range(1,len(all_div)):
----> 7 rating = all_div[review].find("div", class_ = "_3LWZlK _1BLPMq").text
8 review_title = all_div[review].find("p","_2-N8zT").text
9 all_reviews.append((rating,review_title))
AttributeError: 'NoneType' object has no attribute 'text'
我分享下面的更新代码,请帮忙。
更新代码:
def perform_scraping(link):
response = requests.get(link)
soup = BeautifulSoup(response.content, "html.parser")
all_div = soup.find_all("div", class_="_1AtVbE col-12-12")
del all_div[0]
all_reviews = []
for review in range(0, len(all_div)):
rating_elem = all_div[review].find("div", class_="_3LWZlK _1BLPMq")
review_title_elem = all_div[review].find("p", class_="_2-N8zT")
if rating_elem is not None and review_title_elem is not None:
rating = rating_elem.text
review_title = review_title_elem.text
all_reviews.append((rating, review_title))
return all_reviews
all_reviews = []
for page_number in range(1, 2):
link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={page_number}"
all_reviews.extend(perform_scraping(link))
print(all_reviews)
请帮忙。
提前感谢。
希望尽快听到你的回复。
英文:
I am new to data science and learning how to do web scrapping.
I am trying to scrap reviews on particular products, but I am not able to scrap the whole reviews.
By code is working fine with no error, actually first it was giving error as bellow then i changed the code .
Old Code:
def perform_scraping(link):
response = requests.get(link)
soup = BeautifulSoup(response.content,"html.parser")
all_div = soup.find_all("div", class_="_1AtVbE col-12-12")
del all_div[0]
for review in range(1,len(all_div)):
rating = all_div[review].find("div", class_ = "_3LWZlK _1BLPMq").text
review_title = all_div[review].find("p","_2-N8zT").text
all_reviews.append((rating,review_title))
all_reviews = []
for page_number in range(1,2):
i = page_number
link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={i}"
perform_scraping(link)
all_reviews
Error:
AttributeError Traceback (most recent call last)
Cell In[3], line 15
13 i = page_number
14 link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={i}"
---> 15 perform_scraping(link)
16 all_reviews
Cell In[3], line 7, in perform_scraping(link)
5 del all_div[0]
6 for review in range(1,len(all_div)):
----> 7 rating = all_div[review].find("div", class_ = "_3LWZlK _1BLPMq").text
8 review_title = all_div[review].find("p","_2-N8zT").text
9 all_reviews.append((rating,review_title))
AttributeError: 'NoneType' object has no attribute 'text'
I am sharing the updated code bellow please help.
Updated Code:
def perform_scraping(link):
response = requests.get(link)
soup = BeautifulSoup(response.content, "html.parser")
all_div = soup.find_all("div", class_="_1AtVbE col-12-12")
del all_div[0]
all_reviews = []
for review in range(0, len(all_div)):
rating_elem = all_div[review].find("div", class_="_3LWZlK _1BLPMq")
review_title_elem = all_div[review].find("p", class_="_2-N8zT")
if rating_elem is not None and review_title_elem is not None:
rating = rating_elem.text
review_title = review_title_elem.text
all_reviews.append((rating, review_title))
return all_reviews
all_reviews = []
for page_number in range(1, 2):
link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={page_number}"
all_reviews.extend(perform_scraping(link))
print(all_reviews)
Please help.
Thanks in advance.
I hope to here from you soon.
答案1
得分: 0
以下是您要求的翻译部分:
原始代码部分:
Taking your example, you aren't traversing all the pages for the reviews. Your loop goes from 1 to 2, where 2 is exclusive, but you are trying to scrape page 2.
def perform_scraping(link):
response = requests.get(link)
soup = BeautifulSoup(response.content, "html.parser")
all_div = soup.find_all("div", class_="_1AtVbE col-12-12")
del all_div[0]
all_reviews = []
for review in range(0, len all_div)):
rating_elem = all_div[review].find("div", class_="_3LWZlK _1BLPMq")
review_title_elem = all_div[review].find("p", class_="_2-N8zT")
if rating_elem is not None and review_title_elem is not None:
rating = rating_elem.text
review_title = review_title_elem.text
all_reviews.append((rating, review_title))
return all_reviews
all_reviews = []
for page_number in range(1, 3):
link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={page_number}"
all_reviews.extend(perform_scraping(link))
print(all_reviews)
更新后的代码部分:
from bs4 import BeautifulSoup
import re
import requests
def perform_scraping(link):
soup = BeautifulSoup(requests.get(link).text, 'lxml')
reviews = []
for review in soup.find_all('div', {'class': '_27M-vq'}):
rating = review.find('div', {'class': re.compile('_3LWZlK .*_1BLPMq')})
review_title = review.find('p', {'class': '_2-N8zT'})
if rating and review_title:
reviews.append((rating.text, review_title.text))
return reviews
all_reviews = []
product_url = 'https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34'
response = requests.get(product_url)
soup = BeautifulSoup(response.text, "html.parser")
if soup.find('div', {'class': '_2MImiq _1Qnn1K'}).find('span', string=re.compile('Page.* of (\d+)')):
pages = int(soup.find('div', {'class': '_2MImiq _1Qnn1K'}).find('span', string=re.compile('Page.* of (\d+)')).text.strip().split(' ')[-1])
for page in range(1, pages+1):
review_page = f"{product_url}?page={page}"
all_reviews.extend(perform_scraping(review_page))
else:
all_reviews.extend(perform_scraping(product_url))
print(all_reviews)
这些是您提供的代码段的翻译部分,没有包含其他内容。
英文:
Taking your example, you aren't traversing all the pages for the reviews. Your loop goes from 1 to 2, where 2 is exclusive, but you are trying to scrape page 2.
def perform_scraping(link):
response = requests.get(link)
soup = BeautifulSoup(response.content, "html.parser")
all_div = soup.find_all("div", class_="_1AtVbE col-12-12")
del all_div[0]
all_reviews = []
for review in range(0, len(all_div)):
rating_elem = all_div[review].find("div", class_="_3LWZlK _1BLPMq")
review_title_elem = all_div[review].find("p", class_="_2-N8zT")
if rating_elem is not None and review_title_elem is not None:
rating = rating_elem.text
review_title = review_title_elem.text
all_reviews.append((rating, review_title))
return all_reviews
all_reviews = []
for page_number in range(1, 3):
link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={page_number}"
all_reviews.extend(perform_scraping(link))
print(all_reviews)
This is your updated code for the same.
And, I have made the code more dynamic for the pages and cleaned it up a bit.
from bs4 import BeautifulSoup
import re
import requests
def perform_scraping(link):
soup = BeautifulSoup(requests.get(link).text, 'lxml')
reviews = []
for review in soup.find_all('div', {'class': '_27M-vq'}):
rating = review.find('div', {'class': re.compile('_3LWZlK .*_1BLPMq')})
review_title = review.find('p', {'class': '_2-N8zT'})
if rating and review_title:
reviews.append((rating.text, review_title.text))
return reviews
all_reviews = []
product_url = 'https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34'
response = requests.get(product_url)
soup = BeautifulSoup(response.text, "html.parser")
if soup.find('div', {'class': '_2MImiq _1Qnn1K'}).find('span', string=re.compile('Page.* of (\d+)')):
pages = int(soup.find('div', {'class': '_2MImiq _1Qnn1K'}).find('span', string=re.compile('Page.* of (\d+)')).text.strip().split(' ')[-1])
for page in range(1, pages+1):
review_page = f"{product_url}?page={page}"
all_reviews.extend(perform_scraping(review_page))
else:
all_reviews.extend(perform_scraping(product_url))
print(all_reviews)
Both of the snippets work the same for the given example, but the latter is dynamic in the sense and can work for any number of pages.
Output (First code):
[('4', 'Worth the money'), ('5', 'Classy product'), ('5', 'Wonderful'), ('5', 'Great product'), ('4', 'Really Nice'), ('5', 'Just wow!'), ('5', 'Highly recommended.'), ('5', 'Brilliant'), ('5', 'Terrific')]
Output (2nd Code):
[('4', 'Worth the money'), ('5', 'Classy product'), ('2', 'Moderate'), ('1', 'Very poor'), ('5', 'Wonderful'), ('5', 'Great product'), ('1', 'Waste of money!'), ('4', 'Really Nice'), ('5', 'Just wow!'), ('1', 'Useless product'), ('5', 'Highly recommended.'), ('5', 'Brilliant'), ('5', 'Terrific'), ('1', 'Utterly Disappointed')]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论