英文:
How to unnest a dictionary from XML in R?
问题
我正在尝试将此 XML 转换为 R 中的数据框:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml
library(xml2)
library(tidyverse)
fileurl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
xmllist <- as_list(read_xml(fileurl)) 
xml_df = tibble::as_tibble(xmllist) %>%
        unnest_longer(response)
row_wider = xml_df %>%
        unnest_wider(response)
row_df = row_wider %>%
        unnest(cols = names(.)) %>%
        unnest(cols = names(.)) %>%
        readr::type_convert()
问题在于 'location_1' 列是一个字典,在我展开时会显示为 NA。如何将此字典的每个值放入此列?非常感谢您的帮助。
英文:
I am attempting to convert this xml to a dataframe in R:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml
library(xml2)
library(tidyverse)
fileurl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
xmllist <- as_list(read_xml(fileurl)) 
xml_df = tibble::as_tibble(xmllist) %>%
        unnest_longer(response)
row_wider = xml_df %>%
        unnest_wider(response)
row_df = row_wider %>%
        unnest(cols = names(.)) %>%
        unnest(cols = names(.)) %>%
        readr::type_convert()
The issue is that the 'location_1' column is a dictionary and shows up as NA when I unnest. How can I get each of the values of this dictionary into this column? Any help is much appreciated, thanks.
答案1
得分: 3
以下是您要翻译的内容:
"所请求的地址数据以JSON格式存储在XML节点的属性中。接下来我会提取这些属性,将JSON数据转换并合并。然后,生成的数据框可以绑定到之前执行的工作中。
有关详细信息,请参阅注释。
library(xml2)
library(jsonlite)
library(tidyverse)
# 将文件读取为XML
page <- read_xml(fileurl)
# 提取餐厅节点到一个向量中
restaurants <- page %>% xml_find_all(".//row/row")
# 获取存储为属性数据的地址数据
addresses <- restaurants %>% xml_find_first(".//location_1") %>% xml_attr("human_address")
# 这是一个JSON数据结构的向量
# 将JSON转换为数据框
dfs <- lapply(addresses, function(address){
   address %>% fromJSON() %>% as.data.frame()
})
# 合并所有数据框
answer <- bind_rows(dfs)
answer
                     address      city state zip
1           4509 BELAIR ROAD Baltimore    MD    
2              1919 FLEET ST Baltimore    MD    
3             2844 HUDSON ST Baltimore    MD    
4            3998 ROLAND AVE Baltimore    MD    
5         2481 frederick ave Baltimore    MD    
6            2722 HARFORD RD Baltimore    MD    
```"
<details>
<summary>英文:</summary>
The requested address data is stored as JSON in the XML node's attribute.   
Below I extract the attribute, convert the JSON and then merge.  The resulting dataframe can then be binded to the work performed did above.  
See comments for details.   
```r
library(xml2)
library(jsonlite)
library(tidyverse)
#read file as xml
page <- read_xml(fileurl)
#extract out the restaurant nodes into a vector
restaurants <- page %>% xml_find_all(".//row/row")
#get the address data which is stored as attribute data
addresses <- restaurants %>% xml_find_first(".//location_1") %>% xml_attr("human_address") 
#this is a vector of JSON data structures
#convert the JSON to a data frame
dfs <- lapply(addresses, function(address){
   address %>% fromJSON() %>% as.data.frame()
})
#combine all of the data frames
answer<- bind_rows(dfs)
answer
                     address      city state zip
1           4509 BELAIR ROAD Baltimore    MD    
2              1919 FLEET ST Baltimore    MD    
3             2844 HUDSON ST Baltimore    MD    
4            3998 ROLAND AVE Baltimore    MD    
5         2481 frederick ave Baltimore    MD    
6            2722 HARFORD RD Baltimore    MD    
答案2
得分: 2
location_1列是一个空列表(因此你会得到NA值),具有两个属性:human_address,它是一个JSON字符串,和一个逻辑值needs_recoding。获得你想要的结果的一个选项是首先提取这些属性的内容并将它们存储在一个list中。然后,你可以使用两个unnest_wider来展开这个列表列。
library(xml2)
library(tidyverse)
parse_location_1 <- function(x) {
  x$location_1 <- list(
    human_address = jsonlite::fromJSON(attr(x$location_1, "human_address")),
    needs_recoding = attr(x$location_1, "needs_recoding")
  )
  x
}
fileurl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
xmllist <- as_list(read_xml(fileurl))
xml_df <- tibble::as_tibble(xmllist) %>%
  unnest_longer(response) %>|
  mutate(response = map(
    response, parse_location_1
  ))
row_wider <- xml_df %>%
  unnest_wider(response) %>|
  unnest_wider(location_1) %>| 
  unnest_wider(human_address)
row_df <- row_wider %>%
  unnest(cols = where(is.list)) %>%
  unnest(cols = where(is.list)) %>%
  readr::type_convert()
希望这对你有所帮助。
英文:
The location_1 column is an empty list (hence you get NAs) with two attributes human_address which is a JSON string and a logical needs_recoding. One option to get your desired result would be to first extract the content of these attributes and store them in a list. Afterwards you could use two unnest_wider to unnest the list column.
library(xml2)
library(tidyverse)
parse_location_1 <- function(x) {
  x$location_1 <- list(
    human_address = jsonlite::fromJSON(attr(x$location_1, "human_address")),
    needs_recoding = attr(x$location_1, "needs_recoding")
  )
  x
}
fileurl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
xmllist <- as_list(read_xml(fileurl))
xml_df <- tibble::as_tibble(xmllist) %>%
  unnest_longer(response) |>
  mutate(response = map(
    response, parse_location_1
  ))
row_wider <- xml_df %>%
  unnest_wider(response) |>
  unnest_wider(location_1) |> 
  unnest_wider(human_address)
row_df <- row_wider %>%
  unnest(cols = where(is.list)) %>%
  unnest(cols = where(is.list)) %>%
  readr::type_convert()
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   name = col_character(),
#>   zipcode = col_double(),
#>   neighborhood = col_character(),
#>   councildistrict = col_double(),
#>   policedistrict = col_character(),
#>   address = col_character(),
#>   city = col_character(),
#>   state = col_character(),
#>   zip = col_logical(),
#>   needs_recoding = col_logical(),
#>   response_id = col_character()
#> )
head(row_df)
#> # A tibble: 6 × 11
#>   name   zipcode neighborhood councildistrict policedistrict address city  state
#>   <chr>    <dbl> <chr>                  <dbl> <chr>          <chr>   <chr> <chr>
#> 1 410      21206 Frankford                  2 NORTHEASTERN   4509 B… Balt… MD   
#> 2 1919     21231 Fells Point                1 SOUTHEASTERN   1919 F… Balt… MD   
#> 3 SAUTE    21224 Canton                     1 SOUTHEASTERN   2844 H… Balt… MD   
#> 4 #1 CH…   21211 Hampden                   14 NORTHERN       3998 R… Balt… MD   
#> 5 #1 ch…   21223 Millhill                   9 SOUTHWESTERN   2481 f… Balt… MD   
#> 6 19TH …   21218 Clifton Park              14 NORTHEASTERN   2722 H… Balt… MD   
#> # ℹ 3 more variables: zip <lgl>, needs_recoding <lgl>, response_id <chr>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论