2023年2月24日 03:41:28go评论80阅读模式

英文:

Viewing the Full JSON on a Webpage

问题

I have this webpage over here: https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/

我有这个网页：https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/

I want to extract all comments from this website.

我想从这个网站中提取所有的评论。

My Question: When I look at the results, I see that only 37 comments have been collected:

**我的问题：**当我查看结果时，我发现只收集了37条评论：

However, on the actual page, there are more than 1000 comments:

但实际页面上有超过1000条评论：

Is there any way to modify the above code so that more comments are extracted - is there someway to view the full JSON?

有没有办法修改上面的代码，以提取更多的评论 - 有没有办法查看完整的JSON？

Thanks!

谢谢！

Update:

更新：

As per the suggestions in the comments, I tried using the "read_json" function:

根据评论中的建议，我尝试使用 "read_json" 函数：

results = read_json(URL)

body_list <- list()
for (i in seq_along(results[[2]]$data$children)) {
body <- results[[2]]$data$children[[i]]$data$body
body_list[[i]] <- body
}

但这只返回了36条评论，而不是所有的评论？

英文:

I have this webpage over here: https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/

I want to extract all comments from this website.

I learned how to do this in a previous question (https://stackoverflow.com/questions/75545026/converting-json-lists-into-data-frames):

library(jsonlite)
library(purrr)
library(dplyr)
library(tidyr)


URL  &lt;- &quot;https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/.json&quot;

results = fromJSON(URL) |&gt;
  pluck(&quot;data&quot;, &quot;children&quot;) |&gt; 
  bind_rows() |&gt;
  filter(row_number() &gt; 1) |&gt;
  unnest(data) |&gt;
  select(id, author, body) |&gt;
  mutate(comment_id = row_number(), .before = &quot;id&quot;)

My Question: When I look at the results, I see that only 37 comments have been collected:

&gt; dim(results)
[1] 37  4

However, on the actual page, there are more than 1000 comments:

Is there any way to modify the above code so that more comments are extracted - is there someway to view the full JSON?

Thanks!

Update:

As per the suggestions in the comments, I tried using the "read_json" function:

# results[[2]]$data$children[[i]]$data$body

results = read_json(URL)

body_list &lt;- list()
for (i in seq_along(results[[2]]$data$children)) {
    body &lt;- results[[2]]$data$children[[i]]$data$body
    body_list[[i]] &lt;- body
}

But this only returns 36 comments instead of all comments?

答案1

得分: 2

数据具有嵌套结构。您可以使用以下函数进行递归扩展

get_comments <- function(x) {
  if (is.null(x) || (length(x) ==1 && x=="")) return(NULL)
  result = list()
  if (is.null(names(x))) {
    for(p in x) {
      result = c(result, get_comments(p))
    }
  }
  else {
    if (x$kind == "Listing") {
      result = c(result, get_comments(x$data$children))
    } else if (x$kind == "t1") {
      result = c(result, list(x$data), get_comments(x$data$replies))
    }
  }
  if (length(result)>0) {
    result
  } else {
    NULL
  }
}

URL  <- "https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/.json"
json <- jsonlite::read_json(URL)
comments <- get_comments(json)
sapply(comments, function(x) x$body)

但这仍然只返回198个值。还有很多"more"块，其中只有一个ID，您需要进行额外的API调用以获取更多信息。请查看morechildren端点以获取更多详细信息。看起来您需要进行身份验证才能访问这些端点。

英文:

The data has a nested structure. You can do some recursive expansion with the following function

get_comments &lt;- function(x) {
  if (is.null(x) || (length(x) ==1 &amp;&amp; x==&quot;&quot;)) return(NULL)
  result = list()
  if (is.null(names(x))) {
    for(p in x) {
      result = c(result, get_comments(p))
    }
  }
  else {
    if (x$kind == &quot;Listing&quot;) {
      result = c(result, get_comments(x$data$children))
    } else if (x$kind == &quot;t1&quot;) {
      result = c(result, list(x$data), get_comments(x$data$replies))
    }
  }
  if (length(result)&gt;0) {
    result
  } else {
    NULL
  }
}

URL  &lt;- &quot;https://www.reddit.com/r/FunnyandSad/comments/112yfey/really_surprised_how_this_didnt_become_a_big_news/.json&quot;
json &lt;- jsonlite::read_json(URL)
comments &lt;- get_comments(json)
sapply(comments, function(x) x$body)

But that still only returns 198 values. There are bunch of "more" blocks with just an ID where you will need to make additional API calls to get more information. See the morechildren end point for more details. It looks like you'll have to authenticate to access those endpoints.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

查看网页上的完整JSON

问题

答案1

在Go语言中解析具有不同类型的列表的JSON数据。

什么输入会导致golang的json.Marshal返回错误？

Convert YYYY-MM-DD to CYYDDD using groovy.

How to insert array of objects into MongoDB using Go

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论