英文:
Get <title> from <head> of a url in R
问题
从任何URL中,我想要获取其头部中<title>标签内的文本。例如,在下面的截图中,我想要提取的文本是"javascript - Getting the title of a web page given the URL - Stack Overflow"。
我一直在尝试使用httr
来获取头部,但似乎没有标题:
library(httr)
url_head <- HEAD(url = "https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url")
url_head
返回的结果是:
Response [https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url]
Date: 2023-06-22 10:01
Status: 200
Content-Type: text/html; charset=utf-8
<EMPTY BODY>
我还尝试过:
headers(url_head)
但那里也没有标题。
英文:
From any URL, I would like to get the text inside the <title> tag in its header. For example, in the screenshot below, the text "javascript - Getting the title of a web page given the URL - Stack Overflow" is what I want to extract.
I have been trying to get the header with httr
, but it does not seem to have the title:
library(httr)
url_head <- HEAD(url = "https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url"))
url_head
gives
Response [https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url]
Date: 2023-06-22 10:01
Status: 200
Content-Type: text/html; charset=utf-8
<EMPTY BODY>
also tried
headers(url_head)
but nothing there either.
答案1
得分: 2
我会使用 {rvest} 包来完成这个任务。
我们首先读取URL,然后使用CSS选择器 "head > title" 获取元素,这个选择器意味着 "获取头部标签内的标题标签",然后我们使用 html_text()
来提取文本。
library(rvest)
url <- "https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url"
so_page <- read_html(url)
so_page %>%
html_element("head > title") %>%
html_text()
#> [1] "javascript - Getting the title of a web page given the URL - Stack Overflow"
创建于2023-06-22,使用 reprex v2.0.2
英文:
I would use the {rvest} package for this.
We read in the URL, get the element with the CSS selector "head > title" which reads "get the title tag inside the head tag" and then we use html_text()
to extract the text.
library(rvest)
url <- "https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url"
so_page <- read_html(url)
so_page |>
html_element("head > title") |>
html_text()
#> [1] "javascript - Getting the title of a web page given the URL - Stack Overflow"
<sup>Created on 2023-06-22 with reprex v2.0.2</sup>
答案2
得分: 1
你可以使用rvest
包,它提供了强大的网络抓取工具:
library(rvest)
url <- "https://example.com" # 用你想要的URL替换
html <- read_html(url)
title <- html %>% html_node("head title") %>% html_text()
请确保安装rvest
包。
英文:
You can use the rvest
package, which provides powerful tools for web scraping:
library(rvest)
url <- "https://example.com" # Replace with your desired URL
html <- read_html(url)
title <- html %>% html_node("head title") %>% html_text()
Make sure you install the rvest
package.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论