英文:
Viewing the Full JSON on a Webpage
问题
I have this webpage over here: https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/
我有这个网页:https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/
I want to extract all comments from this website.
我想从这个网站中提取所有的评论。
My Question: When I look at the results, I see that only 37 comments have been collected:
**我的问题:**当我查看结果时,我发现只收集了37条评论:
However, on the actual page, there are more than 1000 comments:
但实际页面上有超过1000条评论:
Is there any way to modify the above code so that more comments are extracted - is there someway to view the full JSON?
有没有办法修改上面的代码,以提取更多的评论 - 有没有办法查看完整的JSON?
Thanks!
谢谢!
Update:
更新:
As per the suggestions in the comments, I tried using the "read_json" function:
根据评论中的建议,我尝试使用 "read_json" 函数:
results = read_json(URL)
results = read_json(URL)
body_list <- list()
for (i in seq_along(results[[2]]$data$children)) {
body <- results[[2]]$data$children[[i]]$data$body
body_list[[i]] <- body
}
但这只返回了36条评论,而不是所有的评论?
英文:
I have this webpage over here: https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/
I want to extract all comments from this website.
I learned how to do this in a previous question (https://stackoverflow.com/questions/75545026/converting-json-lists-into-data-frames):
library(jsonlite)
library(purrr)
library(dplyr)
library(tidyr)
URL <- "https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/.json"
results = fromJSON(URL) |>
pluck("data", "children") |>
bind_rows() |>
filter(row_number() > 1) |>
unnest(data) |>
select(id, author, body) |>
mutate(comment_id = row_number(), .before = "id")
My Question: When I look at the results, I see that only 37 comments have been collected:
> dim(results)
[1] 37 4
However, on the actual page, there are more than 1000 comments:
Is there any way to modify the above code so that more comments are extracted - is there someway to view the full JSON?
Thanks!
Update:
As per the suggestions in the comments, I tried using the "read_json" function:
# results[[2]]$data$children[[i]]$data$body
results = read_json(URL)
body_list <- list()
for (i in seq_along(results[[2]]$data$children)) {
body <- results[[2]]$data$children[[i]]$data$body
body_list[[i]] <- body
}
But this only returns 36 comments instead of all comments?
答案1
得分: 2
数据具有嵌套结构。您可以使用以下函数进行递归扩展
get_comments <- function(x) {
if (is.null(x) || (length(x) ==1 && x=="")) return(NULL)
result = list()
if (is.null(names(x))) {
for(p in x) {
result = c(result, get_comments(p))
}
}
else {
if (x$kind == "Listing") {
result = c(result, get_comments(x$data$children))
} else if (x$kind == "t1") {
result = c(result, list(x$data), get_comments(x$data$replies))
}
}
if (length(result)>0) {
result
} else {
NULL
}
}
URL <- "https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/.json"
json <- jsonlite::read_json(URL)
comments <- get_comments(json)
sapply(comments, function(x) x$body)
但这仍然只返回198个值。还有很多"more"块,其中只有一个ID,您需要进行额外的API调用以获取更多信息。请查看morechildren端点以获取更多详细信息。看起来您需要进行身份验证才能访问这些端点。
英文:
The data has a nested structure. You can do some recursive expansion with the following function
get_comments <- function(x) {
if (is.null(x) || (length(x) ==1 && x=="")) return(NULL)
result = list()
if (is.null(names(x))) {
for(p in x) {
result = c(result, get_comments(p))
}
}
else {
if (x$kind == "Listing") {
result = c(result, get_comments(x$data$children))
} else if (x$kind == "t1") {
result = c(result, list(x$data), get_comments(x$data$replies))
}
}
if (length(result)>0) {
result
} else {
NULL
}
}
URL <- "https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/.json"
json <- jsonlite::read_json(URL)
comments <- get_comments(json)
sapply(comments, function(x) x$body)
But that still only returns 198 values. There are bunch of "more" blocks with just an ID where you will need to make additional API calls to get more information. See the morechildren end point for more details. It looks like you'll have to authenticate to access those endpoints.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论