2023年7月31日 20:31:22go评论120阅读模式

英文:

Not getting all reviews while scraping reviews in flipkart product?

问题

我是新手数据科学家，正在学习如何进行网络爬取。
我正在尝试爬取特定产品的评论，但无法爬取所有评论。

我的代码没有错误，实际上，一开始出现了以下错误，然后我更改了代码。

旧代码:

def perform_scraping(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content,"html.parser")
    all_div = soup.find_all("div", class_="_1AtVbE col-12-12")
    del all_div[0]
    for review in range(1,len(all_div)):
        rating = all_div[review].find("div", class_ = "_3LWZlK _1BLPMq").text
        review_title = all_div[review].find("p","_2-N8zT").text
        all_reviews.append((rating,review_title))
all_reviews = []
for page_number in range(1,2):
    i = page_number
    link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={i}"
    perform_scraping(link)
all_reviews

错误:

AttributeError                            Traceback (most recent call last)
    Cell In[3], line 15
         13     i = page_number
         14     link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={i}"
    ---> 15     perform_scraping(link)
         16 all_reviews
    
    Cell In[3], line 7, in perform_scraping(link)
          5 del all_div[0]
          6 for review in range(1,len(all_div)):
    ----> 7     rating = all_div[review].find("div", class_ = "_3LWZlK _1BLPMq").text
          8     review_title = all_div[review].find("p","_2-N8zT").text
          9     all_reviews.append((rating,review_title))
    
    AttributeError: 'NoneType' object has no attribute 'text'

我分享下面的更新代码，请帮忙。

更新代码:

def perform_scraping(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, "html.parser")
    all_div = soup.find_all("div", class_="_1AtVbE col-12-12")
    del all_div[0]
    all_reviews = []
    for review in range(0, len(all_div)):
        rating_elem = all_div[review].find("div", class_="_3LWZlK _1BLPMq")
        review_title_elem = all_div[review].find("p", class_="_2-N8zT")
        if rating_elem is not None and review_title_elem is not None:
            rating = rating_elem.text
            review_title = review_title_elem.text
            all_reviews.append((rating, review_title))
    return all_reviews
all_reviews = []
for page_number in range(1, 2):
    link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&marketplace=FLIPKART&page={page_number}"
    all_reviews.extend(perform_scraping(link))
print(all_reviews)

请帮忙。

提前感谢。

希望尽快听到你的回复。

英文:

I am new to data science and learning how to do web scrapping.
I am trying to scrap reviews on particular products, but I am not able to scrap the whole reviews.

By code is working fine with no error, actually first it was giving error as bellow then i changed the code .

Old Code:

def perform_scraping(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content,&quot;html.parser&quot;)
    all_div = soup.find_all(&quot;div&quot;, class_=&quot;_1AtVbE col-12-12&quot;)
    del all_div[0]
    for review in range(1,len(all_div)):
        rating = all_div[review].find(&quot;div&quot;, class_ = &quot;_3LWZlK _1BLPMq&quot;).text
        review_title = all_div[review].find(&quot;p&quot;,&quot;_2-N8zT&quot;).text
        all_reviews.append((rating,review_title))
all_reviews = []
for page_number in range(1,2):
    i = page_number
    link = f&quot;https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&amp;lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&amp;marketplace=FLIPKART&amp;page={i}&quot;
    perform_scraping(link)
all_reviews

Error:

AttributeError                            Traceback (most recent call last)
Cell In[3], line 15
     13     i = page_number
     14     link = f&quot;https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&amp;lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&amp;marketplace=FLIPKART&amp;page={i}&quot;
---&gt; 15     perform_scraping(link)
     16 all_reviews
Cell In[3], line 7, in perform_scraping(link)
      5 del all_div[0]
      6 for review in range(1,len(all_div)):
----&gt; 7     rating = all_div[review].find(&quot;div&quot;, class_ = &quot;_3LWZlK _1BLPMq&quot;).text
      8     review_title = all_div[review].find(&quot;p&quot;,&quot;_2-N8zT&quot;).text
      9     all_reviews.append((rating,review_title))
AttributeError: &#39;NoneType&#39; object has no attribute &#39;text&#39;

I am sharing the updated code bellow please help.

Updated Code:

def perform_scraping(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, &quot;html.parser&quot;)
    all_div = soup.find_all(&quot;div&quot;, class_=&quot;_1AtVbE col-12-12&quot;)
    del all_div[0]
    all_reviews = []
    for review in range(0, len(all_div)):
        rating_elem = all_div[review].find(&quot;div&quot;, class_=&quot;_3LWZlK _1BLPMq&quot;)
        review_title_elem = all_div[review].find(&quot;p&quot;, class_=&quot;_2-N8zT&quot;)
        if rating_elem is not None and review_title_elem is not None:
            rating = rating_elem.text
            review_title = review_title_elem.text
            all_reviews.append((rating, review_title))
    
    return all_reviews
all_reviews = []
for page_number in range(1, 2):
    link = f&quot;https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&amp;lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&amp;marketplace=FLIPKART&amp;page={page_number}&quot;
    all_reviews.extend(perform_scraping(link))
print(all_reviews)

Please help.

Thanks in advance.

I hope to here from you soon.

答案1

得分: 0

以下是您要求的翻译部分：

原始代码部分：

Taking your example, you aren't traversing all the pages for the reviews. Your loop goes from 1 to 2, where 2 is exclusive, but you are trying to scrape page 2.
def perform_scraping(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, "html.parser")
    all_div = soup.find_all("div", class_="_1AtVbE col-12-12")
    del all_div[0]
    all_reviews = []
    for review in range(0, len all_div)):
        rating_elem = all_div[review].find("div", class_="_3LWZlK _1BLPMq")
        review_title_elem = all_div[review].find("p", class_="_2-N8zT")
        if rating_elem is not None and review_title_elem is not None:
            rating = rating_elem.text
            review_title = review_title_elem.text
            all_reviews.append((rating, review_title))
    return all_reviews
all_reviews = []
for page_number in range(1, 3):
    link = f"https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&amp;lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&amp;marketplace=FLIPKART&amp;page={page_number}"
    all_reviews.extend(perform_scraping(link))
print(all_reviews)

更新后的代码部分：

from bs4 import BeautifulSoup
import re
import requests
def perform_scraping(link):
    soup = BeautifulSoup(requests.get(link).text, 'lxml')
    reviews = []
    for review in soup.find_all('div', {'class': '_27M-vq'}):
        rating = review.find('div', {'class': re.compile('_3LWZlK .*_1BLPMq')})
        review_title = review.find('p', {'class': '_2-N8zT'})
        if rating and review_title:
            reviews.append((rating.text, review_title.text))
    return reviews
all_reviews = []
product_url = 'https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34'
response = requests.get(product_url)
soup = BeautifulSoup(response.text, "html.parser")
if soup.find('div', {'class': '_2MImiq _1Qnn1K'}).find('span', string=re.compile('Page.* of (\d+)')):
    pages = int(soup.find('div', {'class': '_2MImiq _1Qnn1K'}).find('span', string=re.compile('Page.* of (\d+)')).text.strip().split(' ')[-1])
    for page in range(1, pages+1):
        review_page = f"{product_url}?page={page}"
        all_reviews.extend(perform_scraping(review_page))
else:
    all_reviews.extend(perform_scraping(product_url))
print(all_reviews)

这些是您提供的代码段的翻译部分，没有包含其他内容。

英文:

Taking your example, you aren't traversing all the pages for the reviews. Your loop goes from 1 to 2, where 2 is exclusive, but you are trying to scrape page 2.

def perform_scraping(link):
	response = requests.get(link)
	soup = BeautifulSoup(response.content, &quot;html.parser&quot;)
	all_div = soup.find_all(&quot;div&quot;, class_=&quot;_1AtVbE col-12-12&quot;)
	del all_div[0]
	all_reviews = []
	for review in range(0, len(all_div)):
		rating_elem = all_div[review].find(&quot;div&quot;, class_=&quot;_3LWZlK _1BLPMq&quot;)
		review_title_elem = all_div[review].find(&quot;p&quot;, class_=&quot;_2-N8zT&quot;)
		if rating_elem is not None and review_title_elem is not None:
			rating = rating_elem.text
			review_title = review_title_elem.text
			all_reviews.append((rating, review_title))
	
	return all_reviews
all_reviews = []
for page_number in range(1, 3):
	link = f&quot;https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34?pid=BAAGM82GUHTQ3KFZ&amp;lid=LSTBAAGM82GUHTQ3KFZVCWDQ9&amp;marketplace=FLIPKART&amp;page={page_number}&quot;
	all_reviews.extend(perform_scraping(link))
print(all_reviews)

This is your updated code for the same.

And, I have made the code more dynamic for the pages and cleaned it up a bit.

from bs4 import BeautifulSoup
import re
import requests
def perform_scraping(link):
	soup = BeautifulSoup(requests.get(link).text, &#39;lxml&#39;)
	
	reviews = []
	for review in soup.find_all(&#39;div&#39;, {&#39;class&#39;: &#39;_27M-vq&#39;}):
		rating = review.find(&#39;div&#39;, {&#39;class&#39;: re.compile(&#39;_3LWZlK .*_1BLPMq&#39;)})
		review_title = review.find(&#39;p&#39;, {&#39;class&#39;: &#39;_2-N8zT&#39;})
		if rating and review_title:
			reviews.append((rating.text, review_title.text))
	
	return reviews
all_reviews = []
product_url = &#39;https://www.flipkart.com/prowl-tiger-shroff-push-up-board-upper-body-workout-push-up-bar/product-reviews/itm0487671f4df34&#39;
response = requests.get(product_url)
soup = BeautifulSoup(response.text, &quot;html.parser&quot;)
if soup.find(&#39;div&#39;, {&#39;class&#39;: &#39;_2MImiq _1Qnn1K&#39;}).find(&#39;span&#39;, string=re.compile(&#39;Page.* of (\d+)&#39;)):
	pages = int(soup.find(&#39;div&#39;, {&#39;class&#39;: &#39;_2MImiq _1Qnn1K&#39;}).find(&#39;span&#39;, string=re.compile(&#39;Page.* of (\d+)&#39;)).text.strip().split(&#39; &#39;)[-1])
	for page in range(1, pages+1):
		review_page = f&quot;{product_url}?page={page}&quot;
		all_reviews.extend(perform_scraping(review_page))
else:
	all_reviews.extend(perform_scraping(product_url))
print(all_reviews)

Both of the snippets work the same for the given example, but the latter is dynamic in the sense and can work for any number of pages.

Output (First code):

[(&#39;4&#39;, &#39;Worth the money&#39;), (&#39;5&#39;, &#39;Classy product&#39;), (&#39;5&#39;, &#39;Wonderful&#39;), (&#39;5&#39;, &#39;Great product&#39;), (&#39;4&#39;, &#39;Really Nice&#39;), (&#39;5&#39;, &#39;Just wow!&#39;), (&#39;5&#39;, &#39;Highly recommended.&#39;), (&#39;5&#39;, &#39;Brilliant&#39;), (&#39;5&#39;, &#39;Terrific&#39;)]

Output (2nd Code):

[(&#39;4&#39;, &#39;Worth the money&#39;), (&#39;5&#39;, &#39;Classy product&#39;), (&#39;2&#39;, &#39;Moderate&#39;), (&#39;1&#39;, &#39;Very poor&#39;), (&#39;5&#39;, &#39;Wonderful&#39;), (&#39;5&#39;, &#39;Great product&#39;), (&#39;1&#39;, &#39;Waste of money!&#39;), (&#39;4&#39;, &#39;Really Nice&#39;), (&#39;5&#39;, &#39;Just wow!&#39;), (&#39;1&#39;, &#39;Useless product&#39;), (&#39;5&#39;, &#39;Highly recommended.&#39;), (&#39;5&#39;, &#39;Brilliant&#39;), (&#39;5&#39;, &#39;Terrific&#39;), (&#39;1&#39;, &#39;Utterly Disappointed&#39;)]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Flipkart产品上抓取评论时未能获取所有评论？

问题

答案1

Delete from treeview in tkinter

“Unable to import module ‘lambda_function’: No module named ‘msgspec._core’,”

在Pandas中创建一个新列，该列的值基于其他列的计数和固定的特定值。

Why is it that the url I use changes whenever I run the code but when I paste it manually it works fine?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。