如何从GDELT网站上爬取数据

huangapple go评论107阅读模式
英文:

How to scrape data from GDELT

问题

我在尝试从GDELT上爬取数据时遇到了困难。

http://data.gdeltproject.org/events/index.html

我打算编写代码,可以在特定时间段自动下载、解压和合并文件,但尽管多次尝试,我都未能成功。

尽管存在"gdeltr2"包,但它无法正确从原始数据中检索一些变量。

我需要您的帮助。

英文:

I am struggling to scrape data from GDELT.

http://data.gdeltproject.org/events/index.html

I aim to write code that automatically downloads, unzips, and merges files during specific periods, but despite numerous attempts, I have failed to do so.

Although the "gdeltr2" package exists, it does not retrieve some variables correctly from the original data.

I need your help.

答案1

得分: 4

rvest包具有适用于此任务的工具。我们提取所有链接<a href = ...>...</a>节点的href属性,筛选出以".CSV.zip"结尾的链接,并构建完整的URL。现在我们可以下载每个文件,readr::read_tsv()将为我们解压、读取和合并这些文件!

library(rvest)
library(tidyverse)

gdelt_index_url <- 
  "http://data.gdeltproject.org/events"

gdelt_dom <- read_html(gdelt_index_url)

url_df <- 
  gdelt_dom %> 
  html_nodes("a") %> 
  html_attr("href") %> 
  tibble() %> 
  set_names("path") %> 
  filter(str_detect(path, ".CSV.zip$")) %> 
  mutate(url = file.path(gdelt_index_url, path)) %> 
  slice(1:3) # For the purpose of demonstration we use only the first three files
  
map2(url_df$url,
     url_df$path,
     download.file)

gdelt_event_data <- 
  read_tsv(url_df$path, col_names = FALSE)
英文:

The rvest package has the appropriate tools for this. We extract the href attributes from all link &lt;a href = ...&gt;...&lt;/a&gt; nodes, filter down to those that end with ".CSV.zip" and build the full URLs. Now we can download each file and readr::read_tsv() will unpack, read, and combine the files for us!

library(rvest)
library(tidyverse)

gdelt_index_url &lt;- 
  &quot;http://data.gdeltproject.org/events&quot;

gdelt_dom &lt;- read_html(gdelt_index_url)

url_df &lt;- 
  gdelt_dom |&gt; 
  html_nodes(&quot;a&quot;) |&gt; 
  html_attr(&quot;href&quot;) |&gt; 
  tibble() |&gt; 
  set_names(&quot;path&quot;) |&gt; 
  filter(str_detect(path, &quot;.CSV.zip$&quot;)) |&gt; 
  mutate(url = file.path(gdelt_index_url, path)) |&gt; 
  slice(1:3) # For the purpose of demonstration we use only the first three files
  
map2(url_df$url,
     url_df$path,
     download.file)

gdelt_event_data &lt;- 
  read_tsv(url_df$path, col_names = FALSE)

huangapple
  • 本文由 发表于 2023年7月7日 04:08:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76632236.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定