2023年7月11日 02:54:01go评论145阅读模式

英文:

Trying to scrape data out of <div> elements with specific class name

问题

我试图从以下体育统计页面获取数据：https://www.sofascore.com/tournament/football/france/ligue-1/34#42273（22/23赛季）
我想要抓取的表格是“积分榜”，我将使用`rvest`包进行抓取。我已经知道可以根据特定的类名读取包含的HTML文本。现在我想在关于法国“ Ligue 1”的示例网站上执行此操作，该网站位于sofascore.com上。我想要抓取的HTML元素的`<div>`类名是“sc-hLBbgP sc-eDvSVe gjJmZQ fRddxb sc-526d246a-0 evdGB”（如下图所示），我已经尝试在`html_elements()`函数中指定它，但它不起作用。
![在这里输入图片描述][1]
我的当前代码：
```R
library(rvest)
URL <- "https://www.sofascore.com/tournament/football/france/ligue-1/34#42273"
HTML <- read_html(URL)
HTML %>%
html_elements("div") %>%
html_elements(".sc-hLBbgP.sc-eDvSVe.gjJmZQ.fRddxb.sc-526d246a-0.evdGB")

结果：

{xml_nodeset (0)}

直到代码的<div>部分，HTML代码仍然可以被读取，但是一旦输入类名，它就会变成一个空向量。我做错了什么，导致html_elements()函数无法获取这个<div>节点中的文本信息？


请注意，这是您的原始内容的中文翻译，其中包含了您提供的代码和问题描述。
<details>
<summary>英文:</summary>
I am trying to scrape data from the following sports statistics page: https://www.sofascore.com/tournament/football/france/ligue-1/34#42273 (Season 22/23)
The table I want to scrape is the &quot;Standings&quot; table using the `rvest` package. I have come so far that I know that I can read out the contained HTML text based on certain class names. Now I would like to do exactly that on an example site about the french &quot;Ligue 1&quot; on sofascore.com. The `&lt;div&gt;` class name of the html elements I want to scrape is &quot;sc-hLBbgP sc-eDvSVe gjJmZQ fRddxb sc-526d246a-0 evdGB&quot; (Screenshot below) and I have tried to specify it in the `html_elements()` function but it just won&#39;t work properly.
[![enter image description here][1]][1]
My current code:

library(rvest)

URL <- "https://www.sofascore.com/tournament/football/france/ligue-1/34#42273"
HTML <- read_html(URL)

HTML %>%
html_elements("div") %>%
html_elements(".sc-hLBbgP sc-eDvSVe gjJmZQ fRddxb sc-526d246a-0 evdGB")


**The result:**
&gt; {xml_nodeset (0)}
Until the `&lt;div&gt;` part of the code the HTML code can still be read out but as soon as the class name is entered it ends up in an empty vector. What am I doing wrong that the `html_elements()` function won&#39;t get it hands on the text information in this `&lt;div&gt;` node?
  [1]: https://i.stack.imgur.com/Satjg.png
</details>
# 答案1
**得分**: 2
```markdown
使用你的代码后，我运行了一些代码以查看能找到什么信息。
```r
rvest::html_text(HTML) %&gt;% writeClipboard()

如果你将这段代码复制到文本编辑器中，你会发现你的HTML元素不存在。你可以尝试两种方法来处理这个问题：

你可以使用一个会话 ?rvest::session 并附上一个用户代理 ?user_agent，告诉rvest模拟Chrome的行为（搜索如何使用用户代理，你可以找到相关指导）。
你可以跳过所有这些，注意到数据是从一个API中获取的，你可以直接连接到该API。在检查器的网络选项卡下，你可以看到加载页面时进行的所有网络调用。其中有几个是带有你所需数据的API调用。你可以在标头选项卡中看到其中一个API调用的URL为 https://api.sofascore.com/api/v1/unique-tournament/34/season/42273/standings/total，它将以JSON格式提供给你那个表格的数据，而无需加载任何HTML。你可以浏览此选项卡，直到找到包含所有所需数据的JSON为止。


<details>
<summary>英文:</summary>
Using your code I then ran a bit more to see what information I could find.

rvest::html_text(HTML) %>% writeClipboard()


If you copy this into a text editor you will find that your html elements do not exist. There are two ways you can try to deal with this.
1) You can use a session `?rvest::session` with a user agent `?user_agent` to tell rvest to act like Chrome (search for how to use a user agent and you can find instructions).
2) You can skip all that and notice that the data is pulled from an API and you can direction connect to the API. Under the network tab of the inspector you can see all of the network calls made when loading the page. Several of them are API calls with the data you need. You can see the URL for one of the API calls in the Headers tab is https://api.sofascore.com/api/v1/unique-tournament/34/season/42273/standings/total which will give you the data for that table in JSON without loading any HTML at all. You can look through this tab until you find the JSON with all the data you need.
[![enter image description here][1]][1]
  [1]: https://i.stack.imgur.com/D8N8K.png
</details>
# 答案2
**得分**: 2
首先，请注意，如果您的活动触发了任何反爬虫措施，他们将会封禁您的IP地址：
扩展Adam的答案，下面是收集和解析“积分榜”表格数据可能看起来的样子：
``` r
library(rvest)
library(jsonlite)
library(dplyr)
library(purrr)
library(tidyr)
# 首先，我们将从页面源代码中提取赛季ID(s)（也匹配URL中的哈希值，#42273）
url_ <- "https://www.sofascore.com/tournament/football/france/ligue-1/34#42273"
read_html(url_) %>%
  html_element("script#__NEXT_DATA__") %>%
  html_text() %>%
  fromJSON() %>%
  pluck("props", "pageProps", "seasons") %>%
  as_tibble() %>%
  print(n = 3)
#> # A tibble: 27 × 5
#>   name          year  editor    id seasonCoverageInfo
#>   <chr>         <chr> <lgl>  <int> <df[,0]>          
#> 1 Ligue 1 23/24 23/24 FALSE  52571                   
#> 2 Ligue 1 22/23 22/23 FALSE  42273                   
#> 3 Ligue 1 21/22 21/22 FALSE  37167                   
#> # ℹ 24 more rows
# 然后，我们可以使用这些值构建一个用于填充积分榜表格的API调用：
season_id <- "42273"
api_call <- paste0("https://api.sofascore.com/api/v1/unique-tournament/34/season/", season_id, "/standings/total")
# 从JSON中获取/解析表格数据，重新整理/重构/删除一些列：
fromJSON(api_call, simplifyVector = FALSE) %>%
  pluck("standings", 1, "rows") %>%
  tibble(rows = . ) %>%
  unnest_wider(rows) %>%
  hoist(team, name = "name", sname = "shortName") %>%
  select(!where(is.list), -descriptions, -id)

结果：

#> # A tibble: 20 × 10
#>    name        sname position matches  wins scoresFor scoresAgainst losses draws
#>    <chr>       <chr>    <int>   <int> <int>     <int>         <int>  <int> <int>
#>  1 Paris Sain… PSG          1      38    27        89            40      7     4
#>  2 Lens        Lens         2      38    25        68            29      4     9
#>  3 Olympique … Mars…        3      38    22        67            40      9     7
#>  4 Stade Renn… Renn…        4      38    21        69            39     12     5
#>  5 Lille       Lille        5      38    19        65            44      9    10
#>  6 AS Monaco   AS M…        6      38    19        70            58     11     8
#>  7 Olympique … Lyon         7      38    18        65            47     12     8
#>  8 Clermont F… Cler…        8      38    17        45            49     13     8
#>  9 Nice        Nice         9      38    15        48            37     10    13
#> 10 Lorient     Lori…       10      38    15        52            53     13    10
#> 11 Stade de R… Reims       11      38    12        45            45     11    15
#> 12 Montpellier Mont…       12      38    15        65            62     18     5
#> 13 Toulouse    Toul…       13      38    13        51            57     16     9
#> 14 Stade Bres… Brest       14      38    11        44            54     16    11
#> 15 Strasbourg  Stra…       15      38     9        51            59     16    13
#> 16 Nantes      Nant…       16      38     7        37            55     16    15
#> 17 Auxerre     Auxe…       17      38     8        35            63     19    11
#> 18 Ajaccio     Ajac…       18      38     7        23            74     26     5
#> 19 Troyes      Troy…       19      38     4        45            81     22    12
#> 20 Angers      Ange…       20      38     4        33            81     28     6
#> # ℹ 1 more variable: points <int>

^{创建于2023年7月11日，使用reprex v2.0.2}

英文:

First, note that if your activity triggers any anti-scraping measures, they will just ban your IP:

Extending Adam's answer, this is how collecting and parsing "Standings" table data might look like:

library(rvest)
library(jsonlite)
library(dplyr)
library(purrr)
library(tidyr)
# we&#39;ll first extract season id(s) from the page source 
# (also matches the hash in url, #42273)
url_ &lt;- &quot;https://www.sofascore.com/tournament/football/france/ligue-1/34#42273&quot;
read_html(url_) %&gt;% 
  html_element(&quot;script#__NEXT_DATA__&quot;) %&gt;% 
  html_text() %&gt;% 
  fromJSON() %&gt;% 
  pluck(&quot;props&quot;, &quot;pageProps&quot;, &quot;seasons&quot;) %&gt;% 
  as_tibble() %&gt;% 
  print(n = 3)
#&gt; # A tibble: 27 &#215; 5
#&gt;   name          year  editor    id seasonCoverageInfo
#&gt;   &lt;chr&gt;         &lt;chr&gt; &lt;lgl&gt;  &lt;int&gt; &lt;df[,0]&gt;          
#&gt; 1 Ligue 1 23/24 23/24 FALSE  52571                   
#&gt; 2 Ligue 1 22/23 22/23 FALSE  42273                   
#&gt; 3 Ligue 1 21/22 21/22 FALSE  37167                   
#&gt; # ℹ 24 more rows
# we can then use those values to construct an API call that was used to fill Standings table:
season_id &lt;- &quot;42273&quot;
api_call &lt;- paste0(&quot;https://api.sofascore.com/api/v1/unique-tournament/34/season/&quot;, season_id, &quot;/standings/total&quot;)
# fetch / parse JSON / table data from nested list / re-shape/re-structure / drop some columns:
fromJSON(api_call, simplifyVector = FALSE) %&gt;% 
  pluck(&quot;standings&quot;, 1, &quot;rows&quot;) %&gt;% 
  tibble(rows = . ) %&gt;% 
  unnest_wider(rows) %&gt;% 
  hoist(team, name = &quot;name&quot;, sname = &quot;shortName&quot;) %&gt;% 
  select(!where(is.list), -descriptions, -id)

Result:

#&gt; # A tibble: 20 &#215; 10
#&gt;    name        sname position matches  wins scoresFor scoresAgainst losses draws
#&gt;    &lt;chr&gt;       &lt;chr&gt;    &lt;int&gt;   &lt;int&gt; &lt;int&gt;     &lt;int&gt;         &lt;int&gt;  &lt;int&gt; &lt;int&gt;
#&gt;  1 Paris Sain… PSG          1      38    27        89            40      7     4
#&gt;  2 Lens        Lens         2      38    25        68            29      4     9
#&gt;  3 Olympique … Mars…        3      38    22        67            40      9     7
#&gt;  4 Stade Renn… Renn…        4      38    21        69            39     12     5
#&gt;  5 Lille       Lille        5      38    19        65            44      9    10
#&gt;  6 AS Monaco   AS M…        6      38    19        70            58     11     8
#&gt;  7 Olympique … Lyon         7      38    18        65            47     12     8
#&gt;  8 Clermont F… Cler…        8      38    17        45            49     13     8
#&gt;  9 Nice        Nice         9      38    15        48            37     10    13
#&gt; 10 Lorient     Lori…       10      38    15        52            53     13    10
#&gt; 11 Stade de R… Reims       11      38    12        45            45     11    15
#&gt; 12 Montpellier Mont…       12      38    15        65            62     18     5
#&gt; 13 Toulouse    Toul…       13      38    13        51            57     16     9
#&gt; 14 Stade Bres… Brest       14      38    11        44            54     16    11
#&gt; 15 Strasbourg  Stra…       15      38     9        51            59     16    13
#&gt; 16 Nantes      Nant…       16      38     7        37            55     16    15
#&gt; 17 Auxerre     Auxe…       17      38     8        35            63     19    11
#&gt; 18 Ajaccio     Ajac…       18      38     7        23            74     26     5
#&gt; 19 Troyes      Troy…       19      38     4        45            81     22    12
#&gt; 20 Angers      Ange…       20      38     4        33            81     28     6
#&gt; # ℹ 1 more variable: points &lt;int&gt;

<sup>Created on 2023-07-11 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

尝试从具有特定类名的<div>元素中提取数据。

问题

重新排列R中表格table1中的因子水平。

Grouping by ID, Grouping by time (within 5 minutes of each activity), Find Time Difference of Activity in R

How, in R, would I replace String Values in one column of a dataframe with string values from another dataframe using a fuzzy match on a 3rd column?

在另一个数据框中匹配行和列中的数值。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。