获取一个URL中<head>标签中的<title>。

huangapple go评论95阅读模式
英文:

Get <title> from <head> of a url in R

问题

从任何URL中,我想要获取其头部中<title>标签内的文本。例如,在下面的截图中,我想要提取的文本是"javascript - Getting the title of a web page given the URL - Stack Overflow"。

我一直在尝试使用httr来获取头部,但似乎没有标题:

  1. library(httr)
  2. url_head <- HEAD(url = "https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url")
  3. url_head

返回的结果是:

  1. Response [https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url]
  2. Date: 2023-06-22 10:01
  3. Status: 200
  4. Content-Type: text/html; charset=utf-8
  5. &lt;EMPTY BODY&gt;

我还尝试过:

  1. headers(url_head)

但那里也没有标题。

英文:

From any URL, I would like to get the text inside the <title> tag in its header. For example, in the screenshot below, the text "javascript - Getting the title of a web page given the URL - Stack Overflow" is what I want to extract.

获取一个URL中<head>标签中的<title>。

I have been trying to get the header with httr, but it does not seem to have the title:

  1. library(httr)
  2. url_head &lt;- HEAD(url = &quot;https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url&quot;))
  3. url_head

gives

  1. Response [https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url]
  2. Date: 2023-06-22 10:01
  3. Status: 200
  4. Content-Type: text/html; charset=utf-8
  5. &lt;EMPTY BODY&gt;

also tried

  1. headers(url_head)

but nothing there either.

答案1

得分: 2

我会使用 {rvest} 包来完成这个任务。

我们首先读取URL,然后使用CSS选择器 "head > title" 获取元素,这个选择器意味着 "获取头部标签内的标题标签",然后我们使用 html_text() 来提取文本。

  1. library(rvest)
  2. url <- "https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url"
  3. so_page <- read_html(url)
  4. so_page %>%
  5. html_element("head > title") %>%
  6. html_text()
  7. #> [1] "javascript - Getting the title of a web page given the URL - Stack Overflow"

创建于2023-06-22,使用 reprex v2.0.2

英文:

I would use the {rvest} package for this.

We read in the URL, get the element with the CSS selector "head > title" which reads "get the title tag inside the head tag" and then we use html_text() to extract the text.

  1. library(rvest)
  2. url &lt;- &quot;https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url&quot;
  3. so_page &lt;- read_html(url)
  4. so_page |&gt;
  5. html_element(&quot;head &gt; title&quot;) |&gt;
  6. html_text()
  7. #&gt; [1] &quot;javascript - Getting the title of a web page given the URL - Stack Overflow&quot;

<sup>Created on 2023-06-22 with reprex v2.0.2</sup>

答案2

得分: 1

你可以使用rvest包,它提供了强大的网络抓取工具:

  1. library(rvest)
  2. url <- "https://example.com" # 用你想要的URL替换
  3. html <- read_html(url)
  4. title <- html %>% html_node("head title") %>% html_text()

请确保安装rvest包。

英文:

You can use the rvest package, which provides powerful tools for web scraping:

  1. library(rvest)
  2. url &lt;- &quot;https://example.com&quot; # Replace with your desired URL
  3. html &lt;- read_html(url)
  4. title &lt;- html %&gt;% html_node(&quot;head title&quot;) %&gt;% html_text()

Make sure you install the rvest package.

huangapple
  • 本文由 发表于 2023年6月22日 18:03:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76530749.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定