如何进行网页抓取以获取所有数据

huangapple go评论65阅读模式
英文:

how to do a web scraping getting all the data

问题

我正在进行网页抓取以获取Play商店的评论,以便研究自然语言处理,但我的脚本只获取了第一批评论。

我注意到,当我们点击"显示所有评论"时,HTML页面上评论元素的类别和URL保持不变,所以理论上所有数据都应该出现。

有人可以帮我吗?

英文:

I was doing a web scrapping to get reviews from the play store in order to study NLP, but my script only brought the first reviews.

I noticed that when we clicked on "Show all reviews" the classes of the elements of the comments on the html page remain the same and the URL too, so theoretically all the data should come.

Can anyone help me?

import requests
from bs4 import BeautifulSoup

def scrape_playstore_simpler_reviews(url):
    reviews_list = []
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        app_title1 = soup.find_all('h1', class_='Fd93Bb')

        for app_title in app_title1:
            app_title1 = app_title.find('span')


        reviews = soup.find_all('div', class_='EGFGHd')

        for review in reviews:
            user_name = review.find('div', class_='X5PpBb')
            comment_text = review.find('div', class_='h3YV2d')
            helpful_count = review.find('div', class_='AJTPZc')
            post_date = review.find('span', class_='bp9Aid')
            star_rating = review.find('div', class_='iXRFPc').attrs['aria-label']
            
            review_data = {
                'Título do aplicativo': app_title,
                'Nome do usuário': user_name,
                'Texto do comentário': comment_text,
                'Qtd apoio ao comentário': helpful_count,
                'Data da postagem': post_date,
                'Qtd estrelas': star_rating
            }

            reviews_list.append(review_data)
            print(review_data)
    else:
        print('Falha ao fazer requisição HTTP')

url = 'https://play.google.com/store/apps/details?id=ru.zengalt.simpler&hl=pt_BR'
reviews = scrape_playstore_simpler_reviews(url)

scrape_playstore_simpler_reviews(url)

I expected to see all the app rating comments I looked for

答案1

得分: 1

大多数网站现在使用AJAX动态生成内容。要获取与在浏览器中查看页面时看到的相同内容,您需要使用类似于Selenium WebDriver的工具。可惜,轻松抓取静态内容的时代已经过去。

英文:

Most websites now use AJAX to dynamically generate content. To get the same content that you see when viewing the page in a browser you'll need to use something like Selenium WebDriver. Sadly, the days of easily scraped static content are long past.

答案2

得分: 1

代码部分不要翻译,只返回翻译好的部分:

看起来你的网络爬虫脚本存在问题,问题在于你试图从Google Play商店页面提取评论的方式。你只能获取到第一批评论的原因是该页面使用了动态加载,评论在你向下滚动时加载。

要爬取所有评论,你需要模拟滚动或使用Play Store API来获取所有评论。直接爬取页面不会起作用,因为初始响应不包含所有评论。

以下是使用Play Store API来爬取所有评论的一般大纲:

  1. 使用Play Store API来获取评论数据。
  2. 解析JSON响应以提取必要的信息。
英文:

It seems that the issue with your web scraping script lies in how you're trying to extract the reviews from the Google Play Store page. The reason you're only getting the first reviews is that the page uses dynamic loading, and the reviews are loaded as you scroll down.

To scrape all the reviews, you'll need to simulate scrolling or use the Play Store API to get all the reviews. Directly scraping the page won't work since the initial response doesn't contain all the reviews.

Here's a general outline of how you can scrape all reviews using the Play Store API:

  1. Use the Play Store API to fetch the reviews data.
  2. Parse the JSON response to extract the necessary information.

答案3

得分: 0

当您在控制台窗口中使用"ctrl + shift + c"并将鼠标悬停在弹出评论上时,可以看到它们与弹出后的评论不同:

如何进行网页抓取以获取所有数据

我认为您应该学习一些类似下面这样的CSS选择器,以尝试元素的ID/类:

在打开弹出窗口后,在控制台中尝试使用以下内容:

document.querySelectorAll('.RHo1pe')

这应该会获取所有40条评论。

英文:

As you may see when using the console window, and the "ctrl + shift + c" and hovering the mouse over the popup comments, they have a different than the comments after the popup:

如何进行网页抓取以获取所有数据

I think you should learn some css selectors like the one below to try out the ids/classes of the elements:

try using this in the console after opening the popup:

document.querySelectorAll('.RHo1pe')

It should bring all the 40 comments.

huangapple
  • 本文由 发表于 2023年7月18日 07:33:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76708691.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定