英文:
Trying to scrape data out of <div> elements with specific class name
问题
我试图从以下体育统计页面获取数据:https://www.sofascore.com/tournament/football/france/ligue-1/34#42273(22/23赛季)
我想要抓取的表格是“积分榜”,我将使用`rvest`包进行抓取。我已经知道可以根据特定的类名读取包含的HTML文本。现在我想在关于法国“ Ligue 1”的示例网站上执行此操作,该网站位于sofascore.com上。我想要抓取的HTML元素的`<div>`类名是“sc-hLBbgP sc-eDvSVe gjJmZQ fRddxb sc-526d246a-0 evdGB”(如下图所示),我已经尝试在`html_elements()`函数中指定它,但它不起作用。
![在这里输入图片描述][1]
我的当前代码:
```R
library(rvest)
URL <- "https://www.sofascore.com/tournament/football/france/ligue-1/34#42273"
HTML <- read_html(URL)
HTML %>%
html_elements("div") %>%
html_elements(".sc-hLBbgP.sc-eDvSVe.gjJmZQ.fRddxb.sc-526d246a-0.evdGB")
结果:
{xml_nodeset (0)}
直到代码的<div>
部分,HTML代码仍然可以被读取,但是一旦输入类名,它就会变成一个空向量。我做错了什么,导致html_elements()
函数无法获取这个<div>
节点中的文本信息?
请注意,这是您的原始内容的中文翻译,其中包含了您提供的代码和问题描述。
<details>
<summary>英文:</summary>
I am trying to scrape data from the following sports statistics page: https://www.sofascore.com/tournament/football/france/ligue-1/34#42273 (Season 22/23)
The table I want to scrape is the "Standings" table using the `rvest` package. I have come so far that I know that I can read out the contained HTML text based on certain class names. Now I would like to do exactly that on an example site about the french "Ligue 1" on sofascore.com. The `<div>` class name of the html elements I want to scrape is "sc-hLBbgP sc-eDvSVe gjJmZQ fRddxb sc-526d246a-0 evdGB" (Screenshot below) and I have tried to specify it in the `html_elements()` function but it just won't work properly.
[![enter image description here][1]][1]
My current code:
library(rvest)
URL <- "https://www.sofascore.com/tournament/football/france/ligue-1/34#42273"
HTML <- read_html(URL)
HTML %>%
html_elements("div") %>%
html_elements(".sc-hLBbgP sc-eDvSVe gjJmZQ fRddxb sc-526d246a-0 evdGB")
**The result:**
> {xml_nodeset (0)}
Until the `<div>` part of the code the HTML code can still be read out but as soon as the class name is entered it ends up in an empty vector. What am I doing wrong that the `html_elements()` function won't get it hands on the text information in this `<div>` node?
[1]: https://i.stack.imgur.com/Satjg.png
</details>
# 答案1
**得分**: 2
```markdown
使用你的代码后,我运行了一些代码以查看能找到什么信息。
```r
rvest::html_text(HTML) %>% writeClipboard()
如果你将这段代码复制到文本编辑器中,你会发现你的HTML元素不存在。你可以尝试两种方法来处理这个问题:
-
你可以使用一个会话
?rvest::session
并附上一个用户代理?user_agent
,告诉rvest模拟Chrome的行为(搜索如何使用用户代理,你可以找到相关指导)。 -
你可以跳过所有这些,注意到数据是从一个API中获取的,你可以直接连接到该API。在检查器的网络选项卡下,你可以看到加载页面时进行的所有网络调用。其中有几个是带有你所需数据的API调用。你可以在标头选项卡中看到其中一个API调用的URL为 https://api.sofascore.com/api/v1/unique-tournament/34/season/42273/standings/total,它将以JSON格式提供给你那个表格的数据,而无需加载任何HTML。你可以浏览此选项卡,直到找到包含所有所需数据的JSON为止。
<details>
<summary>英文:</summary>
Using your code I then ran a bit more to see what information I could find.
rvest::html_text(HTML) %>% writeClipboard()
If you copy this into a text editor you will find that your html elements do not exist. There are two ways you can try to deal with this.
1) You can use a session `?rvest::session` with a user agent `?user_agent` to tell rvest to act like Chrome (search for how to use a user agent and you can find instructions).
2) You can skip all that and notice that the data is pulled from an API and you can direction connect to the API. Under the network tab of the inspector you can see all of the network calls made when loading the page. Several of them are API calls with the data you need. You can see the URL for one of the API calls in the Headers tab is https://api.sofascore.com/api/v1/unique-tournament/34/season/42273/standings/total which will give you the data for that table in JSON without loading any HTML at all. You can look through this tab until you find the JSON with all the data you need.
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/D8N8K.png
</details>
# 答案2
**得分**: 2
首先,请注意,如果您的活动触发了任何反爬虫措施,他们将会封禁您的IP地址:
扩展Adam的答案,下面是收集和解析“积分榜”表格数据可能看起来的样子:
``` r
library(rvest)
library(jsonlite)
library(dplyr)
library(purrr)
library(tidyr)
# 首先,我们将从页面源代码中提取赛季ID(s)(也匹配URL中的哈希值,#42273)
url_ <- "https://www.sofascore.com/tournament/football/france/ligue-1/34#42273"
read_html(url_) %>%
html_element("script#__NEXT_DATA__") %>%
html_text() %>%
fromJSON() %>%
pluck("props", "pageProps", "seasons") %>%
as_tibble() %>%
print(n = 3)
#> # A tibble: 27 × 5
#> name year editor id seasonCoverageInfo
#> <chr> <chr> <lgl> <int> <df[,0]>
#> 1 Ligue 1 23/24 23/24 FALSE 52571
#> 2 Ligue 1 22/23 22/23 FALSE 42273
#> 3 Ligue 1 21/22 21/22 FALSE 37167
#> # ℹ 24 more rows
# 然后,我们可以使用这些值构建一个用于填充积分榜表格的API调用:
season_id <- "42273"
api_call <- paste0("https://api.sofascore.com/api/v1/unique-tournament/34/season/", season_id, "/standings/total")
# 从JSON中获取/解析表格数据,重新整理/重构/删除一些列:
fromJSON(api_call, simplifyVector = FALSE) %>%
pluck("standings", 1, "rows") %>%
tibble(rows = . ) %>%
unnest_wider(rows) %>%
hoist(team, name = "name", sname = "shortName") %>%
select(!where(is.list), -descriptions, -id)
结果:
#> # A tibble: 20 × 10
#> name sname position matches wins scoresFor scoresAgainst losses draws
#> <chr> <chr> <int> <int> <int> <int> <int> <int> <int>
#> 1 Paris Sain… PSG 1 38 27 89 40 7 4
#> 2 Lens Lens 2 38 25 68 29 4 9
#> 3 Olympique … Mars… 3 38 22 67 40 9 7
#> 4 Stade Renn… Renn… 4 38 21 69 39 12 5
#> 5 Lille Lille 5 38 19 65 44 9 10
#> 6 AS Monaco AS M… 6 38 19 70 58 11 8
#> 7 Olympique … Lyon 7 38 18 65 47 12 8
#> 8 Clermont F… Cler… 8 38 17 45 49 13 8
#> 9 Nice Nice 9 38 15 48 37 10 13
#> 10 Lorient Lori… 10 38 15 52 53 13 10
#> 11 Stade de R… Reims 11 38 12 45 45 11 15
#> 12 Montpellier Mont… 12 38 15 65 62 18 5
#> 13 Toulouse Toul… 13 38 13 51 57 16 9
#> 14 Stade Bres… Brest 14 38 11 44 54 16 11
#> 15 Strasbourg Stra… 15 38 9 51 59 16 13
#> 16 Nantes Nant… 16 38 7 37 55 16 15
#> 17 Auxerre Auxe… 17 38 8 35 63 19 11
#> 18 Ajaccio Ajac… 18 38 7 23 74 26 5
#> 19 Troyes Troy… 19 38 4 45 81 22 12
#> 20 Angers Ange… 20 38 4 33 81 28 6
#> # ℹ 1 more variable: points <int>
创建于2023年7月11日,使用reprex v2.0.2
英文:
First, note that if your activity triggers any anti-scraping measures, they will just ban your IP:
Extending Adam's answer, this is how collecting and parsing "Standings" table data might look like:
library(rvest)
library(jsonlite)
library(dplyr)
library(purrr)
library(tidyr)
# we'll first extract season id(s) from the page source
# (also matches the hash in url, #42273)
url_ <- "https://www.sofascore.com/tournament/football/france/ligue-1/34#42273"
read_html(url_) %>%
html_element("script#__NEXT_DATA__") %>%
html_text() %>%
fromJSON() %>%
pluck("props", "pageProps", "seasons") %>%
as_tibble() %>%
print(n = 3)
#> # A tibble: 27 × 5
#> name year editor id seasonCoverageInfo
#> <chr> <chr> <lgl> <int> <df[,0]>
#> 1 Ligue 1 23/24 23/24 FALSE 52571
#> 2 Ligue 1 22/23 22/23 FALSE 42273
#> 3 Ligue 1 21/22 21/22 FALSE 37167
#> # ℹ 24 more rows
# we can then use those values to construct an API call that was used to fill Standings table:
season_id <- "42273"
api_call <- paste0("https://api.sofascore.com/api/v1/unique-tournament/34/season/", season_id, "/standings/total")
# fetch / parse JSON / table data from nested list / re-shape/re-structure / drop some columns:
fromJSON(api_call, simplifyVector = FALSE) %>%
pluck("standings", 1, "rows") %>%
tibble(rows = . ) %>%
unnest_wider(rows) %>%
hoist(team, name = "name", sname = "shortName") %>%
select(!where(is.list), -descriptions, -id)
Result:
#> # A tibble: 20 × 10
#> name sname position matches wins scoresFor scoresAgainst losses draws
#> <chr> <chr> <int> <int> <int> <int> <int> <int> <int>
#> 1 Paris Sain… PSG 1 38 27 89 40 7 4
#> 2 Lens Lens 2 38 25 68 29 4 9
#> 3 Olympique … Mars… 3 38 22 67 40 9 7
#> 4 Stade Renn… Renn… 4 38 21 69 39 12 5
#> 5 Lille Lille 5 38 19 65 44 9 10
#> 6 AS Monaco AS M… 6 38 19 70 58 11 8
#> 7 Olympique … Lyon 7 38 18 65 47 12 8
#> 8 Clermont F… Cler… 8 38 17 45 49 13 8
#> 9 Nice Nice 9 38 15 48 37 10 13
#> 10 Lorient Lori… 10 38 15 52 53 13 10
#> 11 Stade de R… Reims 11 38 12 45 45 11 15
#> 12 Montpellier Mont… 12 38 15 65 62 18 5
#> 13 Toulouse Toul… 13 38 13 51 57 16 9
#> 14 Stade Bres… Brest 14 38 11 44 54 16 11
#> 15 Strasbourg Stra… 15 38 9 51 59 16 13
#> 16 Nantes Nant… 16 38 7 37 55 16 15
#> 17 Auxerre Auxe… 17 38 8 35 63 19 11
#> 18 Ajaccio Ajac… 18 38 7 23 74 26 5
#> 19 Troyes Troy… 19 38 4 45 81 22 12
#> 20 Angers Ange… 20 38 4 33 81 28 6
#> # ℹ 1 more variable: points <int>
<sup>Created on 2023-07-11 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论