查看网页上的完整JSON

huangapple go评论62阅读模式
英文:

Viewing the Full JSON on a Webpage

问题

I have this webpage over here: https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/

我有这个网页:https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/

I want to extract all comments from this website.

我想从这个网站中提取所有的评论。

My Question: When I look at the results, I see that only 37 comments have been collected:

**我的问题:**当我查看结果时,我发现只收集了37条评论:

However, on the actual page, there are more than 1000 comments:

但实际页面上有超过1000条评论:

Is there any way to modify the above code so that more comments are extracted - is there someway to view the full JSON?

有没有办法修改上面的代码,以提取更多的评论 - 有没有办法查看完整的JSON?

Thanks!

谢谢!

Update:

更新:

As per the suggestions in the comments, I tried using the "read_json" function:

根据评论中的建议,我尝试使用 "read_json" 函数:

results = read_json(URL)

results = read_json(URL)

body_list <- list()
for (i in seq_along(results[[2]]$data$children)) {
body <- results[[2]]$data$children[[i]]$data$body
body_list[[i]] <- body
}

但这只返回了36条评论,而不是所有的评论?

英文:

I have this webpage over here: https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/

I want to extract all comments from this website.

I learned how to do this in a previous question (https://stackoverflow.com/questions/75545026/converting-json-lists-into-data-frames):

library(jsonlite)
library(purrr)
library(dplyr)
library(tidyr)


URL  &lt;- &quot;https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/.json&quot;

results = fromJSON(URL) |&gt;
  pluck(&quot;data&quot;, &quot;children&quot;) |&gt; 
  bind_rows() |&gt;
  filter(row_number() &gt; 1) |&gt;
  unnest(data) |&gt;
  select(id, author, body) |&gt;
  mutate(comment_id = row_number(), .before = &quot;id&quot;)

My Question: When I look at the results, I see that only 37 comments have been collected:

&gt; dim(results)
[1] 37  4

However, on the actual page, there are more than 1000 comments:

查看网页上的完整JSON

Is there any way to modify the above code so that more comments are extracted - is there someway to view the full JSON?

Thanks!

Update:

As per the suggestions in the comments, I tried using the "read_json" function:

# results[[2]]$data$children[[i]]$data$body

results = read_json(URL)

body_list &lt;- list()
for (i in seq_along(results[[2]]$data$children)) {
    body &lt;- results[[2]]$data$children[[i]]$data$body
    body_list[[i]] &lt;- body
}

But this only returns 36 comments instead of all comments?

答案1

得分: 2

数据具有嵌套结构。您可以使用以下函数进行递归扩展

get_comments <- function(x) {
  if (is.null(x) || (length(x) ==1 && x=="")) return(NULL)
  result = list()
  if (is.null(names(x))) {
    for(p in x) {
      result = c(result, get_comments(p))
    }
  }
  else {
    if (x$kind == "Listing") {
      result = c(result, get_comments(x$data$children))
    } else if (x$kind == "t1") {
      result = c(result, list(x$data), get_comments(x$data$replies))
    }
  }
  if (length(result)>0) {
    result
  } else {
    NULL
  }
}

URL  <- "https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/.json"
json <- jsonlite::read_json(URL)
comments <- get_comments(json)
sapply(comments, function(x) x$body)

但这仍然只返回198个值。还有很多"more"块,其中只有一个ID,您需要进行额外的API调用以获取更多信息。请查看morechildren端点以获取更多详细信息。看起来您需要进行身份验证才能访问这些端点。

英文:

The data has a nested structure. You can do some recursive expansion with the following function

get_comments &lt;- function(x) {
  if (is.null(x) || (length(x) ==1 &amp;&amp; x==&quot;&quot;)) return(NULL)
  result = list()
  if (is.null(names(x))) {
    for(p in x) {
      result = c(result, get_comments(p))
    }
  }
  else {
    if (x$kind == &quot;Listing&quot;) {
      result = c(result, get_comments(x$data$children))
    } else if (x$kind == &quot;t1&quot;) {
      result = c(result, list(x$data), get_comments(x$data$replies))
    }
  }
  if (length(result)&gt;0) {
    result
  } else {
    NULL
  }
}

URL  &lt;- &quot;https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/.json&quot;
json &lt;- jsonlite::read_json(URL)
comments &lt;- get_comments(json)
sapply(comments, function(x) x$body)

But that still only returns 198 values. There are bunch of "more" blocks with just an ID where you will need to make additional API calls to get more information. See the morechildren end point for more details. It looks like you'll have to authenticate to access those endpoints.

huangapple
  • 本文由 发表于 2023年2月24日 03:41:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75549590.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定