获取一个URL中<head>标签中的<title>。

huangapple go评论65阅读模式
英文:

Get <title> from <head> of a url in R

问题

从任何URL中,我想要获取其头部中<title>标签内的文本。例如,在下面的截图中,我想要提取的文本是"javascript - Getting the title of a web page given the URL - Stack Overflow"。

我一直在尝试使用httr来获取头部,但似乎没有标题:

library(httr)
url_head <- HEAD(url = "https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url")
url_head

返回的结果是:

Response [https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url]
  Date: 2023-06-22 10:01
  Status: 200
  Content-Type: text/html; charset=utf-8
&lt;EMPTY BODY&gt;

我还尝试过:

headers(url_head)

但那里也没有标题。

英文:

From any URL, I would like to get the text inside the <title> tag in its header. For example, in the screenshot below, the text "javascript - Getting the title of a web page given the URL - Stack Overflow" is what I want to extract.

获取一个URL中<head>标签中的<title>。

I have been trying to get the header with httr, but it does not seem to have the title:

library(httr)
url_head &lt;- HEAD(url = &quot;https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url&quot;))
url_head

gives

Response [https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url]
  Date: 2023-06-22 10:01
  Status: 200
  Content-Type: text/html; charset=utf-8
&lt;EMPTY BODY&gt;

also tried

headers(url_head)

but nothing there either.

答案1

得分: 2

我会使用 {rvest} 包来完成这个任务。

我们首先读取URL,然后使用CSS选择器 "head > title" 获取元素,这个选择器意味着 "获取头部标签内的标题标签",然后我们使用 html_text() 来提取文本。

library(rvest)

url <- "https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url"
so_page <- read_html(url)

so_page %>%
  html_element("head > title") %>%
  html_text()
#> [1] "javascript - Getting the title of a web page given the URL - Stack Overflow"

创建于2023-06-22,使用 reprex v2.0.2

英文:

I would use the {rvest} package for this.

We read in the URL, get the element with the CSS selector "head > title" which reads "get the title tag inside the head tag" and then we use html_text() to extract the text.

library(rvest)

url &lt;- &quot;https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url&quot;
so_page &lt;- read_html(url)

so_page |&gt; 
  html_element(&quot;head &gt; title&quot;) |&gt;
  html_text()
#&gt; [1] &quot;javascript - Getting the title of a web page given the URL - Stack Overflow&quot;

<sup>Created on 2023-06-22 with reprex v2.0.2</sup>

答案2

得分: 1

你可以使用rvest包,它提供了强大的网络抓取工具:

library(rvest)

url <- "https://example.com"  # 用你想要的URL替换

html <- read_html(url)

title <- html %>% html_node("head title") %>% html_text()

请确保安装rvest包。

英文:

You can use the rvest package, which provides powerful tools for web scraping:

library(rvest)

url &lt;- &quot;https://example.com&quot;  # Replace with your desired URL

html &lt;- read_html(url)

title &lt;- html %&gt;% html_node(&quot;head title&quot;) %&gt;% html_text()

Make sure you install the rvest package.

huangapple
  • 本文由 发表于 2023年6月22日 18:03:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76530749.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定