问题

我在尝试从GDELT上爬取数据时遇到了困难。

http://data.gdeltproject.org/events/index.html

我打算编写代码，可以在特定时间段自动下载、解压和合并文件，但尽管多次尝试，我都未能成功。

尽管存在"gdeltr2"包，但它无法正确从原始数据中检索一些变量。

我需要您的帮助。

英文:

I am struggling to scrape data from GDELT.

http://data.gdeltproject.org/events/index.html

I aim to write code that automatically downloads, unzips, and merges files during specific periods, but despite numerous attempts, I have failed to do so.

Although the "gdeltr2" package exists, it does not retrieve some variables correctly from the original data.

I need your help.

答案1

得分: 4

rvest包具有适用于此任务的工具。我们提取所有链接<a href = ...>...</a>节点的href属性，筛选出以".CSV.zip"结尾的链接，并构建完整的URL。现在我们可以下载每个文件，readr::read_tsv()将为我们解压、读取和合并这些文件！

library(rvest)
library(tidyverse)

gdelt_index_url <- 
  "http://data.gdeltproject.org/events"

gdelt_dom <- read_html(gdelt_index_url)

url_df <- 
  gdelt_dom %> 
  html_nodes("a") %> 
  html_attr("href") %> 
  tibble() %> 
  set_names("path") %> 
  filter(str_detect(path, ".CSV.zip$")) %> 
  mutate(url = file.path(gdelt_index_url, path)) %> 
  slice(1:3) # For the purpose of demonstration we use only the first three files
  
map2(url_df$url,
     url_df$path,
     download.file)

gdelt_event_data <- 
  read_tsv(url_df$path, col_names = FALSE)

英文:

The rvest package has the appropriate tools for this. We extract the href attributes from all link <a href = ...>...</a> nodes, filter down to those that end with ".CSV.zip" and build the full URLs. Now we can download each file and readr::read_tsv() will unpack, read, and combine the files for us!

library(rvest)
library(tidyverse)

gdelt_index_url &lt;- 
  &quot;http://data.gdeltproject.org/events&quot;

gdelt_dom &lt;- read_html(gdelt_index_url)

url_df &lt;- 
  gdelt_dom |&gt; 
  html_nodes(&quot;a&quot;) |&gt; 
  html_attr(&quot;href&quot;) |&gt; 
  tibble() |&gt; 
  set_names(&quot;path&quot;) |&gt; 
  filter(str_detect(path, &quot;.CSV.zip$&quot;)) |&gt; 
  mutate(url = file.path(gdelt_index_url, path)) |&gt; 
  slice(1:3) # For the purpose of demonstration we use only the first three files
  
map2(url_df$url,
     url_df$path,
     download.file)

gdelt_event_data &lt;- 
  read_tsv(url_df$path, col_names = FALSE)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何从GDELT网站上爬取数据

问题

答案1

创建用于趋势分析表的if语句：条件长度大于1时出现错误。

babelquarto: 渲染多语言四开书

参数的最大似然估计，遵循多项式逻辑回归。

在lme函数中的split实验中的时间变量。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论