将复杂的长宽数据集转换为R中的长数据集

huangapple go评论76阅读模式
英文:

Complex Long-Wide Dataset to Long Dataset in R

问题

我有一个看起来像这样的复杂数据集:

df1 <- tibble::tribble(~"Canada > London", ~"", ~"Notes", ~"United Kingdom > London", ~"", ~"",
"Restaurant", "Price", "Range", "Restaurant", "Price", "Range",
"Fried beef", "27", "25-30", "Fried beef", "29", "25 - 35",
"Fried potato", "5", "3 - 8", "Fried potato", "8", "3 - 8",
"Bar", "Price", "Range", "Price", "Range", "",
"Beer Lager", "5", "4 - 8", "Beer Lager", "6", "4 - 8",
"Beer Dark", "4", "3 - 7", "Beer Dark", "5", "3 - 7")

或者,以可视化形式:

将复杂的长宽数据集转换为R中的长数据集

它在参数方面很长(如"Beer Lager"、"Beer Dark"等),并且在数据输入方面很宽(像"Canada > London"或"United Kingdom > London"这样的多个宽元素)。

期望的输出应该是两个数据集,看起来像这样:

  1. 第一个数据集(数值):

将复杂的长宽数据集转换为R中的长数据集

  1. 第二个数据集(范围):

将复杂的长宽数据集转换为R中的长数据集

任何建议将不胜感激 将复杂的长宽数据集转换为R中的长数据集

英文:

I have a complex dataset that looks like this:

df1 &lt;- tibble::tribble(~&quot;Canada &gt; London&quot;,	~&quot;&quot;,	~&quot;Notes&quot;,	~&quot;United Kingdom &gt; London&quot;,	~&quot;&quot;,	~&quot;&quot;,
&quot;Restaurant&quot;,	&quot;Price&quot;,	&quot;Range&quot;,	&quot;Restaurant&quot;,	&quot;Price&quot;,	&quot;Range&quot;,
&quot;Fried beef&quot;,	&quot;27&quot;,	&quot;25-30&quot;,	&quot;Fried beef&quot;,	&quot;29&quot;,	&quot;25 - 35&quot;,
&quot;Fried potato&quot;,	&quot;5&quot;,	&quot;3 - 8&quot;,	&quot;Fried potato&quot;,	&quot;8&quot;,	&quot;3 - 8&quot;,
&quot;Bar&quot;,	&quot;Price&quot;,	&quot;Range&quot;,	&quot;Price&quot;,	&quot;Range&quot;,	&quot;&quot;,
&quot;Beer Lager&quot;,	&quot;5&quot;,	&quot;4 - 8&quot;,	&quot;Beer Lager&quot;,	&quot;6&quot;,	&quot;4 - 8&quot;,
&quot;Beer Dark&quot;,	&quot;4&quot;,	&quot;3 - 7&quot;,	&quot;Beer Dark&quot;,	&quot;5&quot;,	&quot;3 - 7&quot;)

Or, for visual representation:

将复杂的长宽数据集转换为R中的长数据集

It is long in parameters (like Beer Lager, Beer Dark, ....) and wide by the data input (many wide elements like Canada > London, or United Kingdom > London).

The desired output would be two datasets that should look like this:

  1. The first dataset (the Values):

将复杂的长宽数据集转换为R中的长数据集

  1. The second dataset (the Ranges):

将复杂的长宽数据集转换为R中的长数据集

Any suggestions would be much appreciated 将复杂的长宽数据集转换为R中的长数据集

答案1

得分: 2

你的数据既不宽也不长,而是一个杂乱的数据表,需要进行一些清理才能将其转换为整洁的数据。之后,你可以使用 tidyr::pivot_wider 来获取你想要的表格:

library(dplyr)
library(tidyr)
library(purrr)

tidy_data <- function(.data, cols) {
  .data <- .data[cols]
  place <- names(.data)[[1]]
  
  .data %>%
    rename(product = 1, price = 2, range = 3) %>%
    filter(!price %in% c("Price", "Range")) %>%
    mutate(place = place)
}

df1_tidy <- purrr::map_dfr(list(1:3, 4:6), tidy_data, .data = df1)

df1_tidy %>%
  select(place, product, price) %>%
  pivot_wider(names_from = product, values_from = price)
#> # A tibble: 2 × 5
#>   place                   `Fried beef` `Fried potato` `Beer Lager` `Beer Dark`
#>   <chr>                   <chr>        <chr>          <chr>        <chr>      
#> 1 Canada > London         27           5              5            4          
#> 2 United Kingdom > London 29           8              6            5

df1_tidy %>%
  select(place, product, range) %>%
  pivot_wider(names_from = product, values_from = range, names_glue = "{product} Range")
#> # A tibble: 2 × 5
#>   place                   `Fried beef Range` Fried potato Range `Beer Lager Range` `Beer Dark Range`
#>   <chr>                   <chr>              <chr>               <chr>              <chr>            
#> 1 Canada > London         25-30              3 - 8               4 - 8              3 - 7            
#> 2 United Kingdom > London 25 - 35            3 - 8               4 - 8              3 - 7

请注意,代码部分未被翻译。

英文:

Your data is neither wide nor long but is a messy data table which needs some cleaning to convert it to tidy data. Afterwards you could get your desired tables using tidyr::pivot_wider:

library(dplyr)
library(tidyr)
library(purrr)

tidy_data &lt;- function(.data, cols) {
  .data &lt;- .data[cols]
  place &lt;- names(.data)[[1]]
  
  .data |&gt; 
    rename(product = 1, price = 2, range = 3) |&gt; 
    filter(!price %in% c(&quot;Price&quot;, &quot;Range&quot;)) |&gt;
    mutate(place = place)
}

df1_tidy &lt;- purrr::map_dfr(list(1:3, 4:6), tidy_data, .data = df1)

df1_tidy |&gt; 
  select(place, product, price) |&gt; 
  pivot_wider(names_from = product, values_from = price)
#&gt; # A tibble: 2 &#215; 5
#&gt;   place                   `Fried beef` `Fried potato` `Beer Lager` `Beer Dark`
#&gt;   &lt;chr&gt;                   &lt;chr&gt;        &lt;chr&gt;          &lt;chr&gt;        &lt;chr&gt;      
#&gt; 1 Canada &gt; London         27           5              5            4          
#&gt; 2 United Kingdom &gt; London 29           8              6            5

df1_tidy |&gt; 
  select(place, product, range) |&gt; 
  pivot_wider(names_from = product, values_from = range, names_glue = &quot;{product} Range&quot;)
#&gt; # A tibble: 2 &#215; 5
#&gt;   place                   `Fried beef Range` Fried potato Rang…&#185; Beer …&#178; Beer …&#179;
#&gt;   &lt;chr&gt;                   &lt;chr&gt;              &lt;chr&gt;               &lt;chr&gt;   &lt;chr&gt;  
#&gt; 1 Canada &gt; London         25-30              3 - 8               4 - 8   3 - 7  
#&gt; 2 United Kingdom &gt; London 25 - 35            3 - 8               4 - 8   3 - 7  
#&gt; # … with abbreviated variable names &#185;​`Fried potato Range`, &#178;​`Beer Lager Range`,
#&gt; #   &#179;​`Beer Dark Range`

答案2

得分: 1

我同意@stefan。实际上,你有4个表格,或者看待方式不同,也可以视为2个表格。下面是两个函数的实现,用于开始清洁和格式化过程。第一个函数按行拆分数据框,第二个函数按列拆分数据框。之后更容易进行格式化、清理和合并这些数据框为一个。

英文:

I agree with @stefan. You actually have 4 tables, or 2 depending on how you look at it. Here is an implementation of 2 functions that start the cleaning and formatting process. The first split the dfs by row and the second function splits them by column. After that it is easier to format, clean, and merge the dfs into 1.

library(tidyverse)

df0 = tibble::tribble(~&quot;Canada &gt; London&quot;,  ~&quot;&quot;,    ~&quot;Notes&quot;,   ~&quot;United Kingdom &gt; London&quot;, ~&quot;&quot;,    ~&quot;&quot;,
                       &quot;Restaurant&quot;,   &quot;Price&quot;,    &quot;Range&quot;,    &quot;Restaurant&quot;,   &quot;Price&quot;,    &quot;Range&quot;,
                       &quot;Fried beef&quot;,   &quot;27&quot;,   &quot;25-30&quot;,    &quot;Fried beef&quot;,   &quot;29&quot;,   &quot;25 - 35&quot;,
                       &quot;Fried potato&quot;, &quot;5&quot;,    &quot;3 - 8&quot;,    &quot;Fried potato&quot;, &quot;8&quot;,    &quot;3 - 8&quot;,
                       &quot;Bar&quot;,  &quot;Price&quot;,    &quot;Range&quot;,    &quot;Price&quot;,    &quot;Range&quot;,    &quot;&quot;,
                       &quot;Beer Lager&quot;,   &quot;5&quot;,    &quot;4 - 8&quot;,    &quot;Beer Lager&quot;,   &quot;6&quot;,    &quot;4 - 8&quot;,
                       &quot;Beer Dark&quot;,    &quot;4&quot;,    &quot;3 - 7&quot;,    &quot;Beer Dark&quot;,    &quot;5&quot;,    &quot;3 - 7&quot;)


split_rows = function(df){
  
  # breaks of sub-dfs within original df
  df_breaks = df[,2] == &quot;Price&quot;
  df_breaks = (1:length(df_breaks))[df_breaks]
  df_breaks
  
  
  
  # list to populate in loop with sub-dfs
  df_list = c()
  
  for(i in 1:length(df_breaks)){
    
    # get start of sub-df
    start = df_breaks[i]
    
    # get end of sub-df
    if(i == length(df_breaks)){
      end = nrow(df) # if its the last set it to the last row of the original df
    }
    else{
      end = df_breaks[i+1]-1 # else, set it to the next start - 1
    }
    
    # subset df
    df_temp = df[start:end,]
    
    # first row as header
    colnames(df_temp) = df_temp[1,]
    df_temp = df_temp[-1,]
    
    # append to df_list
    df_list = append(df_list,list(df_temp))
  }
  
  return(df_list)
}
split_cols = function(df_list,second_df_col_start = 4){
  df_list = lapply(df_list, function(df){
    df1 = df[,1:(second_df_col_start-1)]
    df2 = df[,second_df_col_start:ncol(df)]
    return(list(df1,df2))
    })
  
  return(df_list)
}

output = split_rows(df0) %&gt;% 
  split_cols()

output:

[[1]]
[[1]][[1]]
# A tibble: 2 &#215; 3
  Restaurant   Price Range
  &lt;chr&gt;        &lt;chr&gt; &lt;chr&gt;
1 Fried beef   27    25-30
2 Fried potato 5     3 - 8

[[1]][[2]]
# A tibble: 2 &#215; 3
  Restaurant   Price Range  
  &lt;chr&gt;        &lt;chr&gt; &lt;chr&gt;  
1 Fried beef   29    25 - 35
2 Fried potato 8     3 - 8  


[[2]]
[[2]][[1]]
# A tibble: 2 &#215; 3
  Bar        Price Range
  &lt;chr&gt;      &lt;chr&gt; &lt;chr&gt;
1 Beer Lager 5     4 - 8
2 Beer Dark  4     3 - 7

[[2]][[2]]
# A tibble: 2 &#215; 3
  Price      Range ``   
  &lt;chr&gt;      &lt;chr&gt; &lt;chr&gt;
1 Beer Lager 6     4 - 8
2 Beer Dark  5     3 - 7

huangapple
  • 本文由 发表于 2023年2月6日 19:37:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75360814.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定