2023年2月6日 19:37:09go评论96阅读模式

英文:

Complex Long-Wide Dataset to Long Dataset in R

问题

我有一个看起来像这样的复杂数据集：

df1 <- tibble::tribble(~"Canada > London", ~"", ~"Notes", ~"United Kingdom > London", ~"", ~"",
"Restaurant", "Price", "Range", "Restaurant", "Price", "Range",
"Fried beef", "27", "25-30", "Fried beef", "29", "25 - 35",
"Fried potato", "5", "3 - 8", "Fried potato", "8", "3 - 8",
"Bar", "Price", "Range", "Price", "Range", "",
"Beer Lager", "5", "4 - 8", "Beer Lager", "6", "4 - 8",
"Beer Dark", "4", "3 - 7", "Beer Dark", "5", "3 - 7")

或者，以可视化形式：

它在参数方面很长（如"Beer Lager"、"Beer Dark"等），并且在数据输入方面很宽（像"Canada > London"或"United Kingdom > London"这样的多个宽元素）。

期望的输出应该是两个数据集，看起来像这样：

第一个数据集（数值）：

第二个数据集（范围）：

任何建议将不胜感激

英文:

I have a complex dataset that looks like this:

df1 &lt;- tibble::tribble(~&quot;Canada &gt; London&quot;,	~&quot;&quot;,	~&quot;Notes&quot;,	~&quot;United Kingdom &gt; London&quot;,	~&quot;&quot;,	~&quot;&quot;,
&quot;Restaurant&quot;,	&quot;Price&quot;,	&quot;Range&quot;,	&quot;Restaurant&quot;,	&quot;Price&quot;,	&quot;Range&quot;,
&quot;Fried beef&quot;,	&quot;27&quot;,	&quot;25-30&quot;,	&quot;Fried beef&quot;,	&quot;29&quot;,	&quot;25 - 35&quot;,
&quot;Fried potato&quot;,	&quot;5&quot;,	&quot;3 - 8&quot;,	&quot;Fried potato&quot;,	&quot;8&quot;,	&quot;3 - 8&quot;,
&quot;Bar&quot;,	&quot;Price&quot;,	&quot;Range&quot;,	&quot;Price&quot;,	&quot;Range&quot;,	&quot;&quot;,
&quot;Beer Lager&quot;,	&quot;5&quot;,	&quot;4 - 8&quot;,	&quot;Beer Lager&quot;,	&quot;6&quot;,	&quot;4 - 8&quot;,
&quot;Beer Dark&quot;,	&quot;4&quot;,	&quot;3 - 7&quot;,	&quot;Beer Dark&quot;,	&quot;5&quot;,	&quot;3 - 7&quot;)

Or, for visual representation:

It is long in parameters (like Beer Lager, Beer Dark, ....) and wide by the data input (many wide elements like Canada > London, or United Kingdom > London).

The desired output would be two datasets that should look like this:

The first dataset (the Values):

The second dataset (the Ranges):

Any suggestions would be much appreciated

答案1

得分: 2

你的数据既不宽也不长，而是一个杂乱的数据表，需要进行一些清理才能将其转换为整洁的数据。之后，你可以使用 tidyr::pivot_wider 来获取你想要的表格：

library(dplyr)
library(tidyr)
library(purrr)
tidy_data <- function(.data, cols) {
  .data <- .data[cols]
  place <- names(.data)[[1]]
  
  .data %>%
    rename(product = 1, price = 2, range = 3) %>%
    filter(!price %in% c("Price", "Range")) %>%
    mutate(place = place)
}
df1_tidy <- purrr::map_dfr(list(1:3, 4:6), tidy_data, .data = df1)
df1_tidy %>%
  select(place, product, price) %>%
  pivot_wider(names_from = product, values_from = price)
#> # A tibble: 2 × 5
#>   place                   `Fried beef` `Fried potato` `Beer Lager` `Beer Dark`
#>   <chr>                   <chr>        <chr>          <chr>        <chr>      
#> 1 Canada > London         27           5              5            4          
#> 2 United Kingdom > London 29           8              6            5
df1_tidy %>%
  select(place, product, range) %>%
  pivot_wider(names_from = product, values_from = range, names_glue = "{product} Range")
#> # A tibble: 2 × 5
#>   place                   `Fried beef Range` Fried potato Range `Beer Lager Range` `Beer Dark Range`
#>   <chr>                   <chr>              <chr>               <chr>              <chr>            
#> 1 Canada > London         25-30              3 - 8               4 - 8              3 - 7            
#> 2 United Kingdom > London 25 - 35            3 - 8               4 - 8              3 - 7

请注意，代码部分未被翻译。

英文:

Your data is neither wide nor long but is a messy data table which needs some cleaning to convert it to tidy data. Afterwards you could get your desired tables using tidyr::pivot_wider:

library(dplyr)
library(tidyr)
library(purrr)
tidy_data &lt;- function(.data, cols) {
  .data &lt;- .data[cols]
  place &lt;- names(.data)[[1]]
  
  .data |&gt; 
    rename(product = 1, price = 2, range = 3) |&gt; 
    filter(!price %in% c(&quot;Price&quot;, &quot;Range&quot;)) |&gt;
    mutate(place = place)
}
df1_tidy &lt;- purrr::map_dfr(list(1:3, 4:6), tidy_data, .data = df1)
df1_tidy |&gt; 
  select(place, product, price) |&gt; 
  pivot_wider(names_from = product, values_from = price)
#&gt; # A tibble: 2 &#215; 5
#&gt;   place                   `Fried beef` `Fried potato` `Beer Lager` `Beer Dark`
#&gt;   &lt;chr&gt;                   &lt;chr&gt;        &lt;chr&gt;          &lt;chr&gt;        &lt;chr&gt;      
#&gt; 1 Canada &gt; London         27           5              5            4          
#&gt; 2 United Kingdom &gt; London 29           8              6            5
df1_tidy |&gt; 
  select(place, product, range) |&gt; 
  pivot_wider(names_from = product, values_from = range, names_glue = &quot;{product} Range&quot;)
#&gt; # A tibble: 2 &#215; 5
#&gt;   place                   `Fried beef Range` Fried potato Rang…&#185; Beer …&#178; Beer …&#179;
#&gt;   &lt;chr&gt;                   &lt;chr&gt;              &lt;chr&gt;               &lt;chr&gt;   &lt;chr&gt;  
#&gt; 1 Canada &gt; London         25-30              3 - 8               4 - 8   3 - 7  
#&gt; 2 United Kingdom &gt; London 25 - 35            3 - 8               4 - 8   3 - 7  
#&gt; # … with abbreviated variable names &#185;`Fried potato Range`, &#178;`Beer Lager Range`,
#&gt; #   &#179;`Beer Dark Range`

答案2

得分: 1

我同意@stefan。实际上，你有4个表格，或者看待方式不同，也可以视为2个表格。下面是两个函数的实现，用于开始清洁和格式化过程。第一个函数按行拆分数据框，第二个函数按列拆分数据框。之后更容易进行格式化、清理和合并这些数据框为一个。

英文:

I agree with @stefan. You actually have 4 tables, or 2 depending on how you look at it. Here is an implementation of 2 functions that start the cleaning and formatting process. The first split the dfs by row and the second function splits them by column. After that it is easier to format, clean, and merge the dfs into 1.

library(tidyverse)
df0 = tibble::tribble(~&quot;Canada &gt; London&quot;,  ~&quot;&quot;,    ~&quot;Notes&quot;,   ~&quot;United Kingdom &gt; London&quot;, ~&quot;&quot;,    ~&quot;&quot;,
                       &quot;Restaurant&quot;,   &quot;Price&quot;,    &quot;Range&quot;,    &quot;Restaurant&quot;,   &quot;Price&quot;,    &quot;Range&quot;,
                       &quot;Fried beef&quot;,   &quot;27&quot;,   &quot;25-30&quot;,    &quot;Fried beef&quot;,   &quot;29&quot;,   &quot;25 - 35&quot;,
                       &quot;Fried potato&quot;, &quot;5&quot;,    &quot;3 - 8&quot;,    &quot;Fried potato&quot;, &quot;8&quot;,    &quot;3 - 8&quot;,
                       &quot;Bar&quot;,  &quot;Price&quot;,    &quot;Range&quot;,    &quot;Price&quot;,    &quot;Range&quot;,    &quot;&quot;,
                       &quot;Beer Lager&quot;,   &quot;5&quot;,    &quot;4 - 8&quot;,    &quot;Beer Lager&quot;,   &quot;6&quot;,    &quot;4 - 8&quot;,
                       &quot;Beer Dark&quot;,    &quot;4&quot;,    &quot;3 - 7&quot;,    &quot;Beer Dark&quot;,    &quot;5&quot;,    &quot;3 - 7&quot;)
split_rows = function(df){
  
  # breaks of sub-dfs within original df
  df_breaks = df[,2] == &quot;Price&quot;
  df_breaks = (1:length(df_breaks))[df_breaks]
  df_breaks
  
  
  
  # list to populate in loop with sub-dfs
  df_list = c()
  
  for(i in 1:length(df_breaks)){
    
    # get start of sub-df
    start = df_breaks[i]
    
    # get end of sub-df
    if(i == length(df_breaks)){
      end = nrow(df) # if its the last set it to the last row of the original df
    }
    else{
      end = df_breaks[i+1]-1 # else, set it to the next start - 1
    }
    
    # subset df
    df_temp = df[start:end,]
    
    # first row as header
    colnames(df_temp) = df_temp[1,]
    df_temp = df_temp[-1,]
    
    # append to df_list
    df_list = append(df_list,list(df_temp))
  }
  
  return(df_list)
}
split_cols = function(df_list,second_df_col_start = 4){
  df_list = lapply(df_list, function(df){
    df1 = df[,1:(second_df_col_start-1)]
    df2 = df[,second_df_col_start:ncol(df)]
    return(list(df1,df2))
    })
  
  return(df_list)
}
output = split_rows(df0) %&gt;% 
  split_cols()

output:

[[1]]
[[1]][[1]]
# A tibble: 2 &#215; 3
  Restaurant   Price Range
  &lt;chr&gt;        &lt;chr&gt; &lt;chr&gt;
1 Fried beef   27    25-30
2 Fried potato 5     3 - 8
[[1]][[2]]
# A tibble: 2 &#215; 3
  Restaurant   Price Range  
  &lt;chr&gt;        &lt;chr&gt; &lt;chr&gt;  
1 Fried beef   29    25 - 35
2 Fried potato 8     3 - 8  
[[2]]
[[2]][[1]]
# A tibble: 2 &#215; 3
  Bar        Price Range
  &lt;chr&gt;      &lt;chr&gt; &lt;chr&gt;
1 Beer Lager 5     4 - 8
2 Beer Dark  4     3 - 7
[[2]][[2]]
# A tibble: 2 &#215; 3
  Price      Range ``   
  &lt;chr&gt;      &lt;chr&gt; &lt;chr&gt;
1 Beer Lager 6     4 - 8
2 Beer Dark  5     3 - 7

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将复杂的长宽数据集转换为R中的长数据集

问题

答案1

答案2

从字符串中删除ASCII控制字符

R – 多重嵌套的for循环未完全填充数据框。

如何在华夫饼图符号上显示国旗？（使用，例如，`ggwaffle`）

如何知道一列是否是另一列的倍数？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。