英文:
How to scrape data from GDELT
问题
我在尝试从GDELT上爬取数据时遇到了困难。
http://data.gdeltproject.org/events/index.html
我打算编写代码,可以在特定时间段自动下载、解压和合并文件,但尽管多次尝试,我都未能成功。
尽管存在"gdeltr2"包,但它无法正确从原始数据中检索一些变量。
我需要您的帮助。
英文:
I am struggling to scrape data from GDELT.
http://data.gdeltproject.org/events/index.html
I aim to write code that automatically downloads, unzips, and merges files during specific periods, but despite numerous attempts, I have failed to do so.
Although the "gdeltr2" package exists, it does not retrieve some variables correctly from the original data.
I need your help.
答案1
得分: 4
rvest
包具有适用于此任务的工具。我们提取所有链接<a href = ...>...</a>
节点的href
属性,筛选出以".CSV.zip"结尾的链接,并构建完整的URL。现在我们可以下载每个文件,readr::read_tsv()
将为我们解压、读取和合并这些文件!
library(rvest)
library(tidyverse)
gdelt_index_url <-
"http://data.gdeltproject.org/events"
gdelt_dom <- read_html(gdelt_index_url)
url_df <-
gdelt_dom %>
html_nodes("a") %>
html_attr("href") %>
tibble() %>
set_names("path") %>
filter(str_detect(path, ".CSV.zip$")) %>
mutate(url = file.path(gdelt_index_url, path)) %>
slice(1:3) # For the purpose of demonstration we use only the first three files
map2(url_df$url,
url_df$path,
download.file)
gdelt_event_data <-
read_tsv(url_df$path, col_names = FALSE)
英文:
The rvest
package has the appropriate tools for this. We extract the href
attributes from all link <a href = ...>...</a>
nodes, filter down to those that end with ".CSV.zip" and build the full URLs. Now we can download each file and readr::read_tsv()
will unpack, read, and combine the files for us!
library(rvest)
library(tidyverse)
gdelt_index_url <-
"http://data.gdeltproject.org/events"
gdelt_dom <- read_html(gdelt_index_url)
url_df <-
gdelt_dom |>
html_nodes("a") |>
html_attr("href") |>
tibble() |>
set_names("path") |>
filter(str_detect(path, ".CSV.zip$")) |>
mutate(url = file.path(gdelt_index_url, path)) |>
slice(1:3) # For the purpose of demonstration we use only the first three files
map2(url_df$url,
url_df$path,
download.file)
gdelt_event_data <-
read_tsv(url_df$path, col_names = FALSE)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论