2023年7月10日 20:16:50go评论108阅读模式

英文:

"Error: parse error: trailing garbage" : How do I get content of <script type="application/ld+json"> using R

问题

我正在尝试从房地产经济的一项小研究项目中提取一些数据，想要从<script type="application/ld+json">中获取价格、土地面积、描述、位置等信息。我知道我可以在这个网站上使用CSS选择器。然而，Hemnet会从多个网站聚合信息，不同网站的报价使用的CSS选择器也不同。因此，我想知道如何从<script type="application/ld+json">中获取数据，因为以后可能也会有用。

我已经尝试过以下代码：

library(rvest)
library(xml2)
library(jsonlite)
library(dplyr)
o_url <- "https://www.hemnet.se/bostad/tomt-lisselbo-falu-kommun-svartskar-1-17-14704536"
o_html <- read_html(o_url)
o_json <- html_nodes(o_html, "[type=\"application/ld+json\"]") %>% html_text()
ldjson <- jsonlite::fromJSON(o_json)

但是，我收到了以下错误消息：

Error: parse error: trailing garbage
          https://www.hemnet.se&quot;   }   {     &quot;@context&quot;: &quot;http://schem
                     (right here) ------^

我认为这可能是一个格式问题，但我对JSON不太了解。我最近才开始使用API。实际上，o_json看起来像这样：

[1] ...
[2] "{\"@context\": \"http://schema.org/\",\"@type\": \"Product\",\"name\": \"Svartskär 1:17\",\"image\": \"https://bilder.hemnet.se/images/itemgallery_L/40/dd/40dde2279382da793a435e1d71879ebb.jpg\",\"description\": \"Nu finns möjligheten att förvärva en tomt med sjöläge i Lisselbo! Markarbeten är utförda och kommunalt avlopp är betalt. Kommunalt vatten kostar ca 30000 kronor att koppla in och finns vid tomtgränsen. Varmt välkomna att besöka tomten. Vid frågor kontakta oss.\",\"offers\": {\"@type\": \"Offer\",\"priceCurrency\": \"SEK\",\"price\": 950000,\"priceValidUntil\": \"2020-09-14T13:32:20+0200\",\"availability\": \"http://schema.org/InStock\",\"validFrom\": \"2018-09-14T13:32:20+0200\",\"url\": \"https://www.hemnet.se/bostad/tomt-lisselbo-falu-kommun-svartskar-1-17-14704536\"},\"mpn\": \"14704536\",\"brand\": \"SkandiaMäklarna Falun\"}"
[3]...

我认为问题可能出在\n，我尝试了以下代码：

library(stringr)
clean_json <- str_remove_all(o_json, "\n")
ldjson <- jsonlite::fromJSON(clean_json)

然而，我仍然收到相同的错误消息...

非常感谢您的帮助！如果您对我的代码有任何建议（不仅关于我遇到的问题），我会很高兴听到您的建议！

英文:

I'm trying to extract some data for a little research project in real estate economics and would like to get the price, the lot area, the description, the location, etc., out of <script type="application/ld+json">. I know I can use CSS Selectors on this site. However, Hemnet aggregates offers from several websites and CSS Selectors differ depending on the site the offer comes from. I wuld thus like to know how to get data from <script type="application/ld+json"> because it might also be useful later.

I have already tried this :

library(rvest)
library(xml2)
library(jsonlite)
library(dplyr)
o_url &lt;- &quot;https://www.hemnet.se/bostad/tomt-lisselbo-falu-kommun-svartskar-1-17-14704536&quot;
o_html &lt;- read_html(o_url)
o_json &lt;- html_nodes(o_html, &quot;[type=\&quot;application/ld+json\&quot;]&quot;) %&gt;% html_text()
ldjson &lt;- jsonlite::fromJSON(o_json)

However, I'm getting this error message:

Error: parse error: trailing garbage
          https://www.hemnet.se&quot;   }   {     &quot;@context&quot;: &quot;http://schem
                     (right here) ------^

I think it might be a format problem but I don't really know much about json. I've only started using APIs recently. Indeed, 'o_json' looks like this:

[1] ...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
[2] &quot;\n  {\n    \&quot;@context\&quot;: \&quot;http://schema.org/\&quot;,\n    \&quot;@type\&quot;: \&quot;Product\&quot;,\n    \&quot;name\&quot;: \&quot;Svartsk&#228;r 1:17\&quot;,\n    \&quot;image\&quot;: \&quot;https://bilder.hemnet.se/images/itemgallery_L/40/dd/40dde2279382da793a435e1d71879ebb.jpg\&quot;,\n    \&quot;description\&quot;:  \&quot;Nu finns m&#246;jligheten att f&#246;rv&#228;rva en tomt med sj&#246;l&#228;ge i Lisselbo! Markarbeten &#228;r utf&#246;rda och kommunalt avlopp &#228;r betalt. Kommunalt vatten kostar ca 30000 kronor att koppla in och finns vid tomtgr&#228;nsen. Varmt v&#228;lkomna att bes&#246;ka tomten. Vid fr&#229;gor kontakta oss.\&quot;,\n      \&quot;offers\&quot;: {\n        \&quot;@type\&quot;: \&quot;Offer\&quot;,\n        \&quot;priceCurrency\&quot;: \&quot;SEK\&quot;,\n        \&quot;price\&quot;: 950000,\n        \&quot;priceValidUntil\&quot;: \&quot;2020-09-14T13:32:20+0200\&quot;,\n        \&quot;availability\&quot;: \&quot;http://schema.org/InStock\&quot;,\n        \&quot;validFrom\&quot;: \&quot;2018-09-14T13:32:20+0200\&quot;,\n        \&quot;url\&quot;: \&quot;https://www.hemnet.se/bostad/tomt-lisselbo-falu-kommun-svartskar-1-17-14704536\&quot;\n      },\n    \&quot;mpn\&quot;: \&quot;14704536\&quot;,\n      \&quot;brand\&quot;: \&quot;SkandiaM&#228;klarna Falun\&quot;\n  }\n&quot;
[3]...

I thought there was a problem with "\n" and tried this:

library(stringr)
clean_json &lt;- str_remove_all(o_json, &quot;\n&quot;)
ldjson &lt;- jsonlite::fromJSON(clean_json)

However, I still get the same error message...

Thank you in advance for your help! If you also have any advice on my code (not only about the problem I've been facing), I would be happy to hear it!

答案1

得分: 1

Ok, my bad. I think I've found a(nother) way to do it.
First, you need to:

install.packages(jsonld)

For instance, if you want to get the price, use:

library(jsonld)
expanded <- jsonld_expand(o_json[2])
expa <- jsonlite::fromJSON(expanded)
o_price <- ((expa$`http://schema.org/offers`[[1]]$`http://schema.org/price`)[[1]][1,1])

We now get 950,000 as numeric! If you find a better way to do it, please tell me.

UPDATE:
I've had some more error messages due to html characters in the description which made the format not json-ld. You can thus use this, which will make most (not all) of the error messages disappear:

o_json <- str_remove_all(o_json, "\\\\&quot;")

英文:

Ok, my bad. I think I've found a(nother) way to do it.
First, you need to:

install.packages(jsonld)

For instance, if you want to get the price, use:

library(jsonld)
expanded &lt;- jsonld_expand(o_json[2])
expa &lt;- jsonlite::fromJSON(expanded)
o_price &lt;- ((expa$`http://schema.org/offers`[[1]]$`http://schema.org/price`)[[1]][1,1]

We now get 950000 as numeric! If you find a better way to do it, please tell me

o_json <- str_remove_all(o_json, "\\\\&quot;")

答案2

得分: 0

正如注释中所指出的，选择器返回的第一个元素不是有效的JSON，代码如下：

cat(o_json[1])

返回如下：

  {
    " @context": "http://schema.org",
    " @type": "WebSite",
    " name": "Hemnet",
    " url": "https://www.hemnet.se"
  }
  {
    " @context": "http://schema.org",
    " @type": "Organization",
    " url": "https://www.hemnet.se",
    " logo": "https://assets.hemnet.se/assets/images/hemnet-logo.svg"
  }

尽管它显然通过了结构化数据/JSON-LD验证（https://validator.schema.org/），但第二个块被忽略了。

而且你可能不应该调用 jsonlite::fromJSON(o_json)，因为这是在多个JSON字符串的向量上调用 fromJSON()。它不是向量化的，有点令人惊讶的是，它既不会抱怨，也不会仅使用第一个值，而是似乎会合并参数向量并再次失败。简化的示例如下：

o_json <- c('{"a" : 1}', '{"b" : 2}')
jsonlite::fromJSON(o_json)
#> Error: parse error: trailing garbage
#>                              {"a" : 1} {"b" : 2}
#>                      (right here) ------^

从json-ld元素中提取数据可能看起来像这样：

library(rvest)
library(dplyr)
library(purrr)
library(tidyr)
library(jsonlite)
o_url <- "https://www.hemnet.se/bostad/tomt-lisselbo-falu-kommun-svartskar-1-17-14704536"
o_html <- read_html(o_url)
o_json <- html_elements(o_html, "[type=\"application/ld+json\"]") %>% html_text()
# 解析除了第一个JSON之外的所有JSON
p_json <- map(o_json[-1], parse_json)
# 提取“@type”值以用作列表的名称：
p_json <- set_names(p_json, map(p_json, "@type"))
# 从列表中提取一些随机值：
p_json$Product$description
#> [1] "Nu finns möjligheten att förvärva en tomt med sjöläge i Lisselbo! Markarbeten är utförda och kommunalt avlopp är betalt. Kommunalt vatten kostar ca 30000 kronor att koppla in och finns vid tomtgränsen. Varmt välkomna att besöka tomten. Vid frågor kontakta oss."
p_json$Product$offers$price
#> [1] 950000
# 转换为宽格式数据框（单行，所有字段作为列），以获得更好的概述
p_json %>%
  as.data.frame() %>%
  select(!matches(".type$|.context$")) %>%
  mutate(across(everything(), as.character)) %>%
  pivot_longer(everything())
#> # A tibble: 16 × 2
#>    name                           value                                         
#>    <chr>                          <chr>                                         
#>  1 Product.name                   Svartskär 1:17                                
#>  2 Product.image                  https://bilder.hemnet.se/images/itemgallery_L…
#>  3 Product.description            Nu finns möjligheten att förvärva en tomt med…
#>  4 Product.offers.priceCurrency   SEK                                           
#>  5 Product.offers.price           950000                                        
#>  6 Product.offers.priceValidUntil 2020-09-14T13:32:20+0200                      
#>  7 Product.offers.availability    http://schema.org/InStock                     
#>  8 Product.offers.validFrom       2018-09-14T13:32:20+0200                      
#>  9 Product.offers.url             https://www.hemnet.se/bostad/tomt-lisselbo-fa…
#> 10 Product.mpn                    14704536                                     
#> 11 Product.brand                  SkandiaMäklarna Falun                         
#> 12 Place.address.streetAddress    Svartskär 1:17                                
#> 13 Place.address.addressLocality  Lisselbo, Falu kommun                         
#> 14 Place.address.addressRegion    Dalarnas län                                  
#> 15 Place.address.addressCountry   SE                                            
#> 16 Place.address.postalCode       79196

^{创建于2023-07-10，使用 reprex v2.0.2}

JSON-LD元素的数量似乎不受限制，一些页面还包含了事件的结构化文本条目，例如。

英文:

As noted in the comments, the first element returned by the selector in not a valid JSON,

cat(o_json[1])

returns:

  {
    &quot;@context&quot;: &quot;http://schema.org&quot;,
    &quot;@type&quot;: &quot;WebSite&quot;,
    &quot;name&quot;: &quot;Hemnet&quot;,
    &quot;url&quot;: &quot;https://www.hemnet.se&quot;
  }
  {
    &quot;@context&quot;: &quot;http://schema.org&quot;,
    &quot;@type&quot;: &quot;Organization&quot;,
    &quot;url&quot;: &quot;https://www.hemnet.se&quot;,
    &quot;logo&quot;: &quot;https://assets.hemnet.se/assets/images/hemnet-logo.svg&quot;
  }

Though it apparently passes structured data / JSON-LD validation ( https://validator.schema.org/ ), 2nd block is just ignored.

And you probably should not call jsonlite::fromJSON(o_json), that is fromJSON() on a vector of multiple JSON strings. It's not vectorized and somewhat surprisingly it does not complain nor use just the first value, but it seems to collapse the argument vector and fails again.
Simplified example:

o_json &lt;- c(&#39;{&quot;a&quot; : 1}&#39;, &#39;{&quot;b&quot; : 2}&#39;)
jsonlite::fromJSON(o_json)
#&gt; Error: parse error: trailing garbage
#&gt;                              {&quot;a&quot; : 1} {&quot;b&quot; : 2}
#&gt;                      (right here) ------^

Extracting data from json-ld elements might look something like this:

library(rvest)
library(dplyr)
library(purrr)
library(tidyr)
library(jsonlite)
o_url &lt;- &quot;https://www.hemnet.se/bostad/tomt-lisselbo-falu-kommun-svartskar-1-17-14704536&quot;
o_html &lt;- read_html(o_url)
o_json &lt;- html_elements(o_html, &quot;[type=\&quot;application/ld+json\&quot;]&quot;) %&gt;% html_text()
# parse all but 1st JSON
p_json &lt;- map(o_json[-1], parse_json)
# extract &quot;@type&quot; values to use as names for the list:
p_json &lt;- set_names(p_json, map(p_json, &quot;@type&quot;))
# extract few random values from list:
p_json$Product$description
#&gt; [1] &quot;Nu finns m&#246;jligheten att f&#246;rv&#228;rva en tomt med sj&#246;l&#228;ge i Lisselbo! Markarbeten &#228;r utf&#246;rda och kommunalt avlopp &#228;r betalt. Kommunalt vatten kostar ca 30000 kronor att koppla in och finns vid tomtgr&#228;nsen. Varmt v&#228;lkomna att bes&#246;ka tomten. Vid fr&#229;gor kontakta oss.&quot;
p_json$Product$offers$price
#&gt; [1] 950000
# turn into wide dataframe (wingle line, all fields as columns),
# pivot to longer for better overview
p_json %&gt;% 
  as.data.frame() %&gt;%
  select(!matches(&quot;.type$|.context$&quot;)) %&gt;% 
  mutate(across(everything(), as.character)) %&gt;% 
  pivot_longer(everything())
#&gt; # A tibble: 16 &#215; 2
#&gt;    name                           value                                         
#&gt;    &lt;chr&gt;                          &lt;chr&gt;                                         
#&gt;  1 Product.name                   Svartsk&#228;r 1:17                                
#&gt;  2 Product.image                  https://bilder.hemnet.se/images/itemgallery_L…
#&gt;  3 Product.description            Nu finns m&#246;jligheten att f&#246;rv&#228;rva en tomt med…
#&gt;  4 Product.offers.priceCurrency   SEK                                           
#&gt;  5 Product.offers.price           950000                                        
#&gt;  6 Product.offers.priceValidUntil 2020-09-14T13:32:20+0200                      
#&gt;  7 Product.offers.availability    http://schema.org/InStock                     
#&gt;  8 Product.offers.validFrom       2018-09-14T13:32:20+0200                      
#&gt;  9 Product.offers.url             https://www.hemnet.se/bostad/tomt-lisselbo-fa…
#&gt; 10 Product.mpn                    14704536                                      
#&gt; 11 Product.brand                  SkandiaM&#228;klarna Falun                         
#&gt; 12 Place.address.streetAddress    Svartsk&#228;r 1:17                                
#&gt; 13 Place.address.addressLocality  Lisselbo, Falu kommun                         
#&gt; 14 Place.address.addressRegion    Dalarnas l&#228;n                                  
#&gt; 15 Place.address.addressCountry   SE                                            
#&gt; 16 Place.address.postalCode       79196

<sup>Created on 2023-07-10 with reprex v2.0.2</sup>

Number of json-ld elements do not seem to be limited by 3, some pages included structured text entries for events too, for example.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

"Error: parse error: trailing garbage" : How do I get content of <script type="application/ld+json"> using R

问题

答案1

答案2

API响应JSON转换为Java中的类对象

有效的代码来删除包含非唯一最大值的行？

Translating Stata to R yields different results.

Golang创建JSON数组

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。