2020年1月3日 21:36:28go评论108阅读模式

英文:

Aggregating data with n+m combinations of different attributes in R

问题

I have a data table containing information on the volumes for a varying product mix sold to different customers over several years.

The code for this Data structure is -

Data &lt;- data.table(Region=&quot;USA&quot;, Customer=c(&quot;ABC&quot;,&quot;ABC&quot;,&quot;ABC&quot;,&quot;DEF&quot;,&quot;DEF&quot;,&quot;DEF&quot;,&quot;ABC&quot;,&quot;ABC&quot;,&quot;ABC&quot;,&quot;XYZ&quot;,&quot;XYZ&quot;), ProductName=c(&quot;Radio123&quot;,&quot;Radio234&quot;,&quot;Radio345&quot;,&quot;Radio123&quot;,&quot;Radio234&quot;,&quot;Radio345&quot;,&quot;BB123&quot;,&quot;BB234&quot;,&quot;BB345&quot;,&quot;Radio123&quot;,&quot;BB123&quot;),CustomerID=c(&quot;123&quot;,&quot;123&quot;,&quot;123&quot;,&quot;125&quot;,&quot;125&quot;,&quot;125&quot;,&quot;123&quot;,&quot;123&quot;,&quot;123&quot;,&quot;124&quot;,&quot;124&quot;),&#39;2015&#39;=c(0,123,0,0,0,0,4,0,0,0,0),&#39;2016&#39;=c(0,0,0,0,0,0,10093,0,0,234,0),Class=c(&quot;R1&quot;,&quot;R1&quot;,&quot;R1&quot;,&quot;R1&quot;,&quot;R1&quot;,&quot;R1&quot;,&quot;R2&quot;,&quot;R2&quot;,&quot;R2&quot;,&quot;R1&quot;,&quot;R1&quot;))

For the sake of simplicity, I have only kept 2015 and 2016.
As you can see, there are several different types (6) of products listed under the Product Name and the corresponding volumes sold are listed under the Year Number. There is a key to differentiate the customers and that is the Customer ID. You can also see under the Class that the different products essentially fall into two categories, i.e., R1 and R2.

What I would like to do is to aggregate the data on the Customer ID so I can get a total of the volume sold each year to every customer, i.e., have one row only for each customer.

This i can do with the following -

Data[, lapply(.SD, sum, na.rm = TRUE), .SDcols = c(&quot;2015&quot;, &quot;2016&quot;), by = c(&quot;Region&quot;, &quot;Customer ID&quot;, &quot;Customer&quot;)]

However, instead of a simple aggregate, I would like to pick a combination of 3R1+1R2 type and consider that as one unit of volume sold. Eg: If I aggregate for Customer ID = 123, I will get 127 (123 Radio234 + 4 BB123) under column 2015 and 10093 (BB123) under column 2016 etc. Instead I would like to show 1 unit as 3 Radio234 + 1 BB123 of complete pairs (4 under 2015 in this example) and in another column show the remaining incomplete pairs (111 Radio234 under 2015 in this example).

Any idea how this can be achieved in R? Should I use a simple divide function to get a count of complete groups in each Class and then a min to get the value of complete pairs?

I am unsure of how to program it but this is the logical I am thinking of -

Aggregate Class based on Customer ID which gives exactly two rows per Customer, i.e., 1 for volumes sold of type R1 and one for R2.
Divide R1 by 3 to get a count of complete sets that can be formed using R1 and then do a min(R1,R2) to see how many sets we can make. Eg: For 2015, we have 123/3 giving us a dividend of 41 in case of R1 and we see we have 4 R2. Then we do a min(41,4) which gives us the value of complete sets we can make using the formula (3R1+R2), thus giving us 4. We are left with 123-(3x4) = 111 R1 Class products or 37 incomplete pairs (missing R2).

Possible Output -

Data_Agg &lt;- data.table(Region=&quot;USA&quot;, 
                       Customer=c(&quot;ABC&quot;,&quot;DEF&quot;,&quot;XYZ&quot;),
                       CustomerID=c(&quot;123&quot;,&quot;125&quot;,&quot;124&quot;),
                       &#39;2015_complete&#39;= c(4,0,0),
                       &#39;2016_complete&#39;=c(0,0,0),
                       &#39;2015_incomplete&#39;=c(&quot;37_R1&quot;,0,0),
                       &#39;2016_incomplete&#39;=c(&quot;10093_R2&quot;,0,&quot;78_R1&quot;))
&gt; Data_Agg
#   Region Customer CustomerID 2015_complete 2016_complete 2015_incomplete 2016_incomplete
#1:    USA      ABC        123             4             0           37_R1        10093_R2
#2:    USA      DEF        125             0             0               0               0
#3:    USA      XYZ        124             0             0               0           78_R1

Or is there a more efficient way to do this? As an extension, I would also like to create an array or key value pair where I can store the Product Name composition of each row aggregate (how many Radio123 or Radio234 or BB123 make up a particular aggregated row).

Thanks for the help!

英文:

I have a data table containing information on the volumes for a varying product mix sold to different customers over several years.

The code for this Data structure is -

Data &lt;- data.table(Region=&quot;USA&quot;, Customer=c(&quot;ABC&quot;,&quot;ABC&quot;,&quot;ABC&quot;,&quot;DEF&quot;,&quot;DEF&quot;,&quot;DEF&quot;,&quot;ABC&quot;,&quot;ABC&quot;,&quot;ABC&quot;,&quot;XYZ&quot;,&quot;XYZ&quot;), ProductName=c(&quot;Radio123&quot;,&quot;Radio234&quot;,&quot;Radio345&quot;,&quot;Radio123&quot;,&quot;Radio234&quot;,&quot;Radio345&quot;,&quot;BB123&quot;,&quot;BB234&quot;,&quot;BB345&quot;,&quot;Radio123&quot;,&quot;BB123&quot;),CustomerID=c(&quot;123&quot;,&quot;123&quot;,&quot;123&quot;,&quot;125&quot;,&quot;125&quot;,&quot;125&quot;,&quot;123&quot;,&quot;123&quot;,&quot;123&quot;,&quot;124&quot;,&quot;124&quot;),&#39;2015&#39;=c(0,123,0,0,0,0,4,0,0,0,0),&#39;2016&#39;=c(0,0,0,0,0,0,10093,0,0,234,0),Class=c(&quot;R1&quot;,&quot;R1&quot;,&quot;R1&quot;,&quot;R1&quot;,&quot;R1&quot;,&quot;R1&quot;,&quot;R2&quot;,&quot;R2&quot;,&quot;R2&quot;,&quot;R1&quot;,&quot;R1&quot;))

What I would like to do is to aggregate the data on the Customer ID so I can get a total of the volume sold each year to every customer, i.e., have one row only for each customer.

This i can do with the following -

Data[, lapply(.SD, sum, na.rm = TRUE), .SDcols = c(&quot;2015&quot;, &quot;2016&quot;), by = c(&quot;Region&quot;, &quot;Customer ID&quot;, &quot;Customer&quot;)]

Any idea how this can be achieved in R? Should I use a simple divide function to get a count of complete groups in each Class and then a min to get the value of complete pairs?

I am unsure of how to program it but this is the logical I am thinking of -

Aggregate Class based on Customer ID which gives exactly two rows per Customer, i.e., 1 for volumes sold of type R1 and one for R2.
Divide R1 by 3 to get a count of complete sets that can be formed using R1 and then do a min(R1,R2) to see how many sets we can make. Eg: For 2015, we have 123/3 giving us a dividend of 41 in case of R1 and we see we have 4 R2. Then we do a min(41,4) which gives us the value of complete sets we can make using the formula (3R1+R2), thus giving us 4. We are left with 123-(3x4) = 111 R1 Class products or 37 incomplete pairs (missing R2).

Possible Output -

Data_Agg &lt;- data.table(Region=&quot;USA&quot;, 
                       Customer=c(&quot;ABC&quot;,&quot;DEF&quot;,&quot;XYZ&quot;),
                       CustomerID=c(&quot;123&quot;,&quot;125&quot;,&quot;124&quot;),
                       &#39;2015_complete&#39;= c(4,0,0),
                       &#39;2016_complete&#39;=c(0,0,0),
                       &#39;2015_incomplete&#39;=c(&quot;37_R1&quot;,0,0),
                       &#39;2016_incomplete&#39;=c(&quot;10093_R2&quot;,0,&quot;78_R1&quot;))
&gt; Data_Agg
#   Region Customer CustomerID 2015_complete 2016_complete 2015_incomplete 2016_incomplete
#1:    USA      ABC        123             4             0           37_R1        10093_R2
#2:    USA      DEF        125             0             0               0               0
#3:    USA      XYZ        124             0             0               0           78_R1

Thanks for the help!

答案1

得分: 1

以下是您要翻译的内容：

"Your specific question can be solved in several ways.
However, the first thing I believe you should address is the data shape.
You have column names that contain values ("2015", "2016",...) which should be avoided if possible. I recommend you reading this article by Hadley Wickham about tidy data which is a great guidance in most of the cases. Furthermore, I think the Product Name could be divided into two variables (in case you want to check at Radios or BB-s at some point).

I address the first issue of column names that contain values using data.table's melt function. You can see the end result in the code below."

library(data.table)
Data <-
    data.table(
        Region = "USA",
        Customer = c(
            "ABC",
            "ABC",
            "ABC",
            "DEF",
            "DEF",
            "DEF",
            "ABC",
            "ABC",
            "ABC",
            "XYZ",
            "XYZ"
        ),
        ProductName = c(
            "Radio123",
            "Radio234",
            "Radio345",
            "Radio123",
            "Radio234",
            "Radio345",
            "BB123",
            "BB234",
            "BB345",
            "Radio123",
            "BB123"
        ),
        CustomerID = c(
            "123",
            "123",
            "123",
            "125",
            "125",
            "125",
            "123",
            "123",
            "123",
            "124",
            "124"
        ),
        "2015" = c(0, 123, 0, 0, 0, 0, 4, 0, 0, 0, 0),
        "2016" = c(0, 0, 0, 0, 0, 0, 10093, 0, 0, 234, 0),
        Class = c("R1", "R1", "R1", "R1", "R1", "R1", "R2", "R2", "R2", "R1", "R1")
    )
# initial dataset
head(Data)
#>    Region Customer ProductName CustomerID 2015 2016 Class
#> 1:    USA      ABC    Radio123        123    0    0    R1
#> 2:    USA      ABC    Radio234        123  123    0    R1
#> 3:    USA      ABC    Radio345        123    0    0    R1
#> 4:    USA      DEF    Radio123        125    0    0    R1
#> 5:    USA      DEF    Radio234        125    0    0    R1
#> 6:    USA      DEF    Radio345        125    0    0    R1
# function to find number only elements, obtained from:
# https://stackoverflow.com/questions/43195519/check-if-string-contains-only-numbers-or-only-characters-r/43195554
numbers_only <- function(x)
    ! grepl("\\D", x)
# Get the variables that are not years
id_vars <- names(Data)[!numbers_only(names(Data))]
tidy_data <-
    melt(
        Data,
        id.vars = id_vars,
        variable.name = "year",
        value.name = "volume"
    )
#new format
head(tidy_data)
#>    Region Customer ProductName CustomerID Class year volume
#> 1:    USA      ABC    Radio123        123    R1 2015      0
#> 2:    USA      ABC    Radio234        123    R1 2015    123
#> 3:    USA      ABC    Radio345        123    R1 2015      0
#> 4:    USA      DEF    Radio123        125    R1 2015      0
#> 5:    USA      DEF    Radio234        125    R1 2015      0
#> 6:    USA      DEF    Radio345        125    R1 2015      0
# end result
result <-
    tidy_data[, .(agg_volume = sum(ifelse(Class == "R1", volume, 3 * volume))) ,
              by = c("Region", "Customer", "ProductName", "CustomerID", "year")]
#final result
head(result)
#>    Region Customer ProductName CustomerID year agg_volume
#> 1:    USA      ABC    Radio123        123 2015          0
#> 2:    USA      ABC    Radio234        123 2015        123
#> 3:    USA      ABC    Radio345        123 2015          0
#> 4:    USA      DEF    Radio123        125 2015          0
#> 5:    USA      DEF    Radio234        125 2015          0
#> 6:    USA      DEF    Radio345        125 2015          0

^{Created on 2020-01-03 by the reprex package (v0.3.0)}

P.S. I recommend you using the reprex package to create your examples as it will make your life much easier.

英文:

Your specific question can be solved in several ways.
However, the first thing I believe you should address is the data shape.
You have column names that contain values ("2015", "2016",...) which should be avoided if possible. I recommend you reading this article by Hadley Wickham about tidy data which is a great guidance in most of the cases. Furthermore, I think the Product Name could be divided into two variables (in case you want to check at Radios or BB-s at some point).

I address the first issue of column names that contain values using data.table's melt function. You can see the end result in the code below.

library(data.table)
Data &lt;-
   data.table(
       Region = &quot;USA&quot;,
       Customer = c(
           &quot;ABC&quot;,
           &quot;ABC&quot;,
           &quot;ABC&quot;,
           &quot;DEF&quot;,
           &quot;DEF&quot;,
           &quot;DEF&quot;,
           &quot;ABC&quot;,
           &quot;ABC&quot;,
           &quot;ABC&quot;,
           &quot;XYZ&quot;,
           &quot;XYZ&quot;
       ),
       ProductName = c(
           &quot;Radio123&quot;,
           &quot;Radio234&quot;,
           &quot;Radio345&quot;,
           &quot;Radio123&quot;,
           &quot;Radio234&quot;,
           &quot;Radio345&quot;,
           &quot;BB123&quot;,
           &quot;BB234&quot;,
           &quot;BB345&quot;,
           &quot;Radio123&quot;,
           &quot;BB123&quot;
       ),
       CustomerID = c(
           &quot;123&quot;,
           &quot;123&quot;,
           &quot;123&quot;,
           &quot;125&quot;,
           &quot;125&quot;,
           &quot;125&quot;,
           &quot;123&quot;,
           &quot;123&quot;,
           &quot;123&quot;,
           &quot;124&quot;,
           &quot;124&quot;
       ),
       &quot;2015&quot; = c(0, 123, 0, 0, 0, 0, 4, 0, 0, 0, 0),
       &quot;2016&quot; = c(0, 0, 0, 0, 0, 0, 10093, 0, 0, 234, 0),
       Class = c(&quot;R1&quot;, &quot;R1&quot;, &quot;R1&quot;, &quot;R1&quot;, &quot;R1&quot;, &quot;R1&quot;, &quot;R2&quot;, &quot;R2&quot;, &quot;R2&quot;, &quot;R1&quot;, &quot;R1&quot;)
   )
# initial dataset
head(Data)
#&gt;    Region Customer ProductName CustomerID 2015 2016 Class
#&gt; 1:    USA      ABC    Radio123        123    0    0    R1
#&gt; 2:    USA      ABC    Radio234        123  123    0    R1
#&gt; 3:    USA      ABC    Radio345        123    0    0    R1
#&gt; 4:    USA      DEF    Radio123        125    0    0    R1
#&gt; 5:    USA      DEF    Radio234        125    0    0    R1
#&gt; 6:    USA      DEF    Radio345        125    0    0    R1
# function to find number only elements, obtained from:
# https://stackoverflow.com/questions/43195519/check-if-string-contains-only-numbers-or-only-characters-r/43195554
numbers_only &lt;- function(x)
   ! grepl(&quot;\\D&quot;, x)
# Get the variables that are not years
id_vars &lt;- names(Data)[!numbers_only(names(Data))]
tidy_data &lt;-
   melt(
       Data,
       id.vars = id_vars,
       variable.name = &quot;year&quot;,
       value.name = &quot;volume&quot;
   )
#new format
head(tidy_data)
#&gt;    Region Customer ProductName CustomerID Class year volume
#&gt; 1:    USA      ABC    Radio123        123    R1 2015      0
#&gt; 2:    USA      ABC    Radio234        123    R1 2015    123
#&gt; 3:    USA      ABC    Radio345        123    R1 2015      0
#&gt; 4:    USA      DEF    Radio123        125    R1 2015      0
#&gt; 5:    USA      DEF    Radio234        125    R1 2015      0
#&gt; 6:    USA      DEF    Radio345        125    R1 2015      0
# end result
result &lt;-
   tidy_data[, .(agg_volume = sum(ifelse(Class == &quot;R1&quot;, volume, 3 * volume))) ,
             by = c(&quot;Region&quot;, &quot;Customer&quot;, &quot;ProductName&quot;, &quot;CustomerID&quot;, &quot;year&quot;)]
#final result
head(result)
#&gt;    Region Customer ProductName CustomerID year agg_volume
#&gt; 1:    USA      ABC    Radio123        123 2015          0
#&gt; 2:    USA      ABC    Radio234        123 2015        123
#&gt; 3:    USA      ABC    Radio345        123 2015          0
#&gt; 4:    USA      DEF    Radio123        125 2015          0
#&gt; 5:    USA      DEF    Radio234        125 2015          0
#&gt; 6:    USA      DEF    Radio345        125 2015          0

<sup>Created on 2020-01-03 by the reprex package (v0.3.0)</sup>

P.S. I recommend you using the reprex package to create your examples as it will make your life much easier.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中聚合不同属性的n+m组合数据。

问题

答案1

合并具有相同结构的列表元素。

空间信息聚合

如何在R中有多个“变化”变量时重新塑造为长格式？

如何在R中复制生存分析并获得与Stata中获得的完全相同的标准误差？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。