2023年6月22日 18:03:06go评论95阅读模式

英文:

Get <title> from <head> of a url in R

问题

从任何URL中，我想要获取其头部中<title>标签内的文本。例如，在下面的截图中，我想要提取的文本是"javascript - Getting the title of a web page given the URL - Stack Overflow"。

我一直在尝试使用httr来获取头部，但似乎没有标题：

library(httr)
url_head <- HEAD(url = "https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url")
url_head

返回的结果是：

Response [https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url]
  Date: 2023-06-22 10:01
  Status: 200
  Content-Type: text/html; charset=utf-8
&lt;EMPTY BODY&gt;

我还尝试过：

headers(url_head)

但那里也没有标题。

英文:

From any URL, I would like to get the text inside the <title> tag in its header. For example, in the screenshot below, the text "javascript - Getting the title of a web page given the URL - Stack Overflow" is what I want to extract.

I have been trying to get the header with httr, but it does not seem to have the title:

library(httr)
url_head &lt;- HEAD(url = &quot;https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url&quot;))
url_head

gives

Response [https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url]
  Date: 2023-06-22 10:01
  Status: 200
  Content-Type: text/html; charset=utf-8
&lt;EMPTY BODY&gt;

also tried

headers(url_head)

but nothing there either.

答案1

得分: 2

我会使用 {rvest} 包来完成这个任务。

我们首先读取URL，然后使用CSS选择器 "head > title" 获取元素，这个选择器意味着 "获取头部标签内的标题标签"，然后我们使用 html_text() 来提取文本。

library(rvest)
url <- "https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url"
so_page <- read_html(url)
so_page %>%
  html_element("head > title") %>%
  html_text()
#> [1] "javascript - Getting the title of a web page given the URL - Stack Overflow"

^{创建于2023-06-22，使用 reprex v2.0.2}

英文:

I would use the {rvest} package for this.

We read in the URL, get the element with the CSS selector "head > title" which reads "get the title tag inside the head tag" and then we use html_text() to extract the text.

library(rvest)
url &lt;- &quot;https://stackoverflow.com/questions/10940241/getting-the-title-of-a-web-page-given-the-url&quot;
so_page &lt;- read_html(url)
so_page |&gt; 
  html_element(&quot;head &gt; title&quot;) |&gt;
  html_text()
#&gt; [1] &quot;javascript - Getting the title of a web page given the URL - Stack Overflow&quot;

<sup>Created on 2023-06-22 with reprex v2.0.2</sup>

答案2

得分: 1

你可以使用rvest包，它提供了强大的网络抓取工具：

library(rvest)
url <- "https://example.com"  # 用你想要的URL替换
html <- read_html(url)
title <- html %>% html_node("head title") %>% html_text()

请确保安装rvest包。

英文:

You can use the rvest package, which provides powerful tools for web scraping:

library(rvest)
url &lt;- &quot;https://example.com&quot;  # Replace with your desired URL
html &lt;- read_html(url)
title &lt;- html %&gt;% html_node(&quot;head title&quot;) %&gt;% html_text()

Make sure you install the rvest package.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

获取一个URL中<head>标签中的<title>。

问题

答案1

答案2

“R” 和 SQL 用于查询大型 JSON 数据

R告诉我两个因子水平是”0″和”1″，但然后列出1和2，并将它们绘制为1和2。

创建新列的变异和if else函数

这个箱线图中的这些点是什么意思？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。