2023年7月18日 07:33:31go评论65阅读模式

英文:

how to do a web scraping getting all the data

问题

我正在进行网页抓取以获取Play商店的评论，以便研究自然语言处理，但我的脚本只获取了第一批评论。

我注意到，当我们点击"显示所有评论"时，HTML页面上评论元素的类别和URL保持不变，所以理论上所有数据都应该出现。

有人可以帮我吗？

英文:

I was doing a web scrapping to get reviews from the play store in order to study NLP, but my script only brought the first reviews.

I noticed that when we clicked on "Show all reviews" the classes of the elements of the comments on the html page remain the same and the URL too, so theoretically all the data should come.

Can anyone help me?

import requests
from bs4 import BeautifulSoup

def scrape_playstore_simpler_reviews(url):
    reviews_list = []
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, &#39;html.parser&#39;)
        app_title1 = soup.find_all(&#39;h1&#39;, class_=&#39;Fd93Bb&#39;)

        for app_title in app_title1:
            app_title1 = app_title.find(&#39;span&#39;)


        reviews = soup.find_all(&#39;div&#39;, class_=&#39;EGFGHd&#39;)

        for review in reviews:
            user_name = review.find(&#39;div&#39;, class_=&#39;X5PpBb&#39;)
            comment_text = review.find(&#39;div&#39;, class_=&#39;h3YV2d&#39;)
            helpful_count = review.find(&#39;div&#39;, class_=&#39;AJTPZc&#39;)
            post_date = review.find(&#39;span&#39;, class_=&#39;bp9Aid&#39;)
            star_rating = review.find(&#39;div&#39;, class_=&#39;iXRFPc&#39;).attrs[&#39;aria-label&#39;]
            
            review_data = {
                &#39;T&#237;tulo do aplicativo&#39;: app_title,
                &#39;Nome do usu&#225;rio&#39;: user_name,
                &#39;Texto do coment&#225;rio&#39;: comment_text,
                &#39;Qtd apoio ao coment&#225;rio&#39;: helpful_count,
                &#39;Data da postagem&#39;: post_date,
                &#39;Qtd estrelas&#39;: star_rating
            }

            reviews_list.append(review_data)
            print(review_data)
    else:
        print(&#39;Falha ao fazer requisi&#231;&#227;o HTTP&#39;)

url = &#39;https://play.google.com/store/apps/details?id=ru.zengalt.simpler&amp;hl=pt_BR&#39;
reviews = scrape_playstore_simpler_reviews(url)

scrape_playstore_simpler_reviews(url)

I expected to see all the app rating comments I looked for

答案1

得分: 1

大多数网站现在使用AJAX动态生成内容。要获取与在浏览器中查看页面时看到的相同内容，您需要使用类似于Selenium WebDriver的工具。可惜，轻松抓取静态内容的时代已经过去。

英文:

Most websites now use AJAX to dynamically generate content. To get the same content that you see when viewing the page in a browser you'll need to use something like Selenium WebDriver. Sadly, the days of easily scraped static content are long past.

答案2

得分: 1

代码部分不要翻译，只返回翻译好的部分：

看起来你的网络爬虫脚本存在问题，问题在于你试图从Google Play商店页面提取评论的方式。你只能获取到第一批评论的原因是该页面使用了动态加载，评论在你向下滚动时加载。

要爬取所有评论，你需要模拟滚动或使用Play Store API来获取所有评论。直接爬取页面不会起作用，因为初始响应不包含所有评论。

以下是使用Play Store API来爬取所有评论的一般大纲：

使用Play Store API来获取评论数据。
解析JSON响应以提取必要的信息。

英文:

It seems that the issue with your web scraping script lies in how you're trying to extract the reviews from the Google Play Store page. The reason you're only getting the first reviews is that the page uses dynamic loading, and the reviews are loaded as you scroll down.

To scrape all the reviews, you'll need to simulate scrolling or use the Play Store API to get all the reviews. Directly scraping the page won't work since the initial response doesn't contain all the reviews.

Here's a general outline of how you can scrape all reviews using the Play Store API:

Use the Play Store API to fetch the reviews data.
Parse the JSON response to extract the necessary information.

答案3

得分: 0

当您在控制台窗口中使用"ctrl + shift + c"并将鼠标悬停在弹出评论上时，可以看到它们与弹出后的评论不同：

我认为您应该学习一些类似下面这样的CSS选择器，以尝试元素的ID/类：

在打开弹出窗口后，在控制台中尝试使用以下内容：

document.querySelectorAll('.RHo1pe')

这应该会获取所有40条评论。

英文:

As you may see when using the console window, and the "ctrl + shift + c" and hovering the mouse over the popup comments, they have a different than the comments after the popup:

I think you should learn some css selectors like the one below to try out the ids/classes of the elements:

try using this in the console after opening the popup:

document.querySelectorAll(&#39;.RHo1pe&#39;)

It should bring all the 40 comments.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何进行网页抓取以获取所有数据

问题

答案1

答案2

答案3

Python/数独求解器/方法运行不正确

Tkinter子窗口/父窗口管理

逃逸分析

如何允许用户在Django中上传多个图片和文件？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论