在R中聚合不同属性的n+m组合数据。

huangapple go评论77阅读模式
英文:

Aggregating data with n+m combinations of different attributes in R

问题

I have a data table containing information on the volumes for a varying product mix sold to different customers over several years.

在R中聚合不同属性的n+m组合数据。

The code for this Data structure is -

Data <- data.table(Region="USA", Customer=c("ABC","ABC","ABC","DEF","DEF","DEF","ABC","ABC","ABC","XYZ","XYZ"), ProductName=c("Radio123","Radio234","Radio345","Radio123","Radio234","Radio345","BB123","BB234","BB345","Radio123","BB123"),CustomerID=c("123","123","123","125","125","125","123","123","123","124","124"),'2015'=c(0,123,0,0,0,0,4,0,0,0,0),'2016'=c(0,0,0,0,0,0,10093,0,0,234,0),Class=c("R1","R1","R1","R1","R1","R1","R2","R2","R2","R1","R1"))

For the sake of simplicity, I have only kept 2015 and 2016.
As you can see, there are several different types (6) of products listed under the Product Name and the corresponding volumes sold are listed under the Year Number. There is a key to differentiate the customers and that is the Customer ID. You can also see under the Class that the different products essentially fall into two categories, i.e., R1 and R2.

What I would like to do is to aggregate the data on the Customer ID so I can get a total of the volume sold each year to every customer, i.e., have one row only for each customer.

This i can do with the following -

Data[, lapply(.SD, sum, na.rm = TRUE), .SDcols = c("2015", "2016"), by = c("Region", "Customer ID", "Customer")]

However, instead of a simple aggregate, I would like to pick a combination of 3R1+1R2 type and consider that as one unit of volume sold. Eg: If I aggregate for Customer ID = 123, I will get 127 (123 Radio234 + 4 BB123) under column 2015 and 10093 (BB123) under column 2016 etc. Instead I would like to show 1 unit as 3 Radio234 + 1 BB123 of complete pairs (4 under 2015 in this example) and in another column show the remaining incomplete pairs (111 Radio234 under 2015 in this example).

Any idea how this can be achieved in R? Should I use a simple divide function to get a count of complete groups in each Class and then a min to get the value of complete pairs?

I am unsure of how to program it but this is the logical I am thinking of -

  1. Aggregate Class based on Customer ID which gives exactly two rows per Customer, i.e., 1 for volumes sold of type R1 and one for R2.

  2. Divide R1 by 3 to get a count of complete sets that can be formed using R1 and then do a min(R1,R2) to see how many sets we can make. Eg: For 2015, we have 123/3 giving us a dividend of 41 in case of R1 and we see we have 4 R2. Then we do a min(41,4) which gives us the value of complete sets we can make using the formula (3R1+R2), thus giving us 4. We are left with 123-(3x4) = 111 R1 Class products or 37 incomplete pairs (missing R2).

Possible Output -

Data_Agg <- data.table(Region="USA", 
                       Customer=c("ABC","DEF","XYZ"),
                       CustomerID=c("123","125","124"),
                       '2015_complete'= c(4,0,0),
                       '2016_complete'=c(0,0,0),
                       '2015_incomplete'=c("37_R1",0,0),
                       '2016_incomplete'=c("10093_R2",0,"78_R1"))
> Data_Agg
#   Region Customer CustomerID 2015_complete 2016_complete 2015_incomplete 2016_incomplete
#1:    USA      ABC        123             4             0           37_R1        10093_R2
#2:    USA      DEF        125             0             0               0               0
#3:    USA      XYZ        124             0             0               0           78_R1

Or is there a more efficient way to do this? As an extension, I would also like to create an array or key value pair where I can store the Product Name composition of each row aggregate (how many Radio123 or Radio234 or BB123 make up a particular aggregated row).

Thanks for the help!

英文:

I have a data table containing information on the volumes for a varying product mix sold to different customers over several years.

在R中聚合不同属性的n+m组合数据。

The code for this Data structure is -

Data <- data.table(Region="USA", Customer=c("ABC","ABC","ABC","DEF","DEF","DEF","ABC","ABC","ABC","XYZ","XYZ"), ProductName=c("Radio123","Radio234","Radio345","Radio123","Radio234","Radio345","BB123","BB234","BB345","Radio123","BB123"),CustomerID=c("123","123","123","125","125","125","123","123","123","124","124"),'2015'=c(0,123,0,0,0,0,4,0,0,0,0),'2016'=c(0,0,0,0,0,0,10093,0,0,234,0),Class=c("R1","R1","R1","R1","R1","R1","R2","R2","R2","R1","R1"))

For the sake of simplicity, I have only kept 2015 and 2016.
As you can see, there are several different types (6) of products listed under the Product Name and the corresponding volumes sold are listed under the Year Number. There is a key to differentiate the customers and that is the Customer ID. You can also see under the Class that the different products essentially fall into two categories, i.e., R1 and R2.

What I would like to do is to aggregate the data on the Customer ID so I can get a total of the volume sold each year to every customer, i.e., have one row only for each customer.

This i can do with the following -

Data[, lapply(.SD, sum, na.rm = TRUE), .SDcols = c("2015", "2016"), by = c("Region", "Customer ID", "Customer")]

However, instead of a simple aggregate, I would like to pick a combination of 3R1+1R2 type and consider that as one unit of volume sold. Eg: If I aggregate for Customer ID = 123, I will get 127 (123 Radio234 + 4 BB123) under column 2015 and 10093 (BB123) under column 2016 etc. Instead I would like to show 1 unit as 3 Radio234 + 1 BB123 of complete pairs (4 under 2015 in this example) and in another column show the remaining incomplete pairs (111 Radio234 under 2015 in this example).

Any idea how this can be achieved in R? Should I use a simple divide function to get a count of complete groups in each Class and then a min to get the value of complete pairs?

I am unsure of how to program it but this is the logical I am thinking of -

  1. Aggregate Class based on Customer ID which gives exactly two rows per Customer, i.e., 1 for volumes sold of type R1 and one for R2.

  2. Divide R1 by 3 to get a count of complete sets that can be formed using R1 and then do a min(R1,R2) to see how many sets we can make. Eg: For 2015, we have 123/3 giving us a dividend of 41 in case of R1 and we see we have 4 R2. Then we do a min(41,4) which gives us the value of complete sets we can make using the formula (3R1+R2), thus giving us 4. We are left with 123-(3x4) = 111 R1 Class products or 37 incomplete pairs (missing R2).

Possible Output -

Data_Agg <- data.table(Region="USA", 
                       Customer=c("ABC","DEF","XYZ"),
                       CustomerID=c("123","125","124"),
                       '2015_complete'= c(4,0,0),
                       '2016_complete'=c(0,0,0),
                       '2015_incomplete'=c("37_R1",0,0),
                       '2016_incomplete'=c("10093_R2",0,"78_R1"))
> Data_Agg
#   Region Customer CustomerID 2015_complete 2016_complete 2015_incomplete 2016_incomplete
#1:    USA      ABC        123             4             0           37_R1        10093_R2
#2:    USA      DEF        125             0             0               0               0
#3:    USA      XYZ        124             0             0               0           78_R1

Or is there a more efficient way to do this? As an extension, I would also like to create an array or key value pair where I can store the Product Name composition of each row aggregate (how many Radio123 or Radio234 or BB123 make up a particular aggregated row).

Thanks for the help!

答案1

得分: 1

以下是您要翻译的内容:

"Your specific question can be solved in several ways.
However, the first thing I believe you should address is the data shape.
You have column names that contain values ("2015", "2016",...) which should be avoided if possible. I recommend you reading this article by Hadley Wickham about tidy data which is a great guidance in most of the cases. Furthermore, I think the Product Name could be divided into two variables (in case you want to check at Radios or BB-s at some point).

I address the first issue of column names that contain values using data.table's melt function. You can see the end result in the code below."

library(data.table)

Data <-
    data.table(
        Region = "USA",
        Customer = c(
            "ABC",
            "ABC",
            "ABC",
            "DEF",
            "DEF",
            "DEF",
            "ABC",
            "ABC",
            "ABC",
            "XYZ",
            "XYZ"
        ),
        ProductName = c(
            "Radio123",
            "Radio234",
            "Radio345",
            "Radio123",
            "Radio234",
            "Radio345",
            "BB123",
            "BB234",
            "BB345",
            "Radio123",
            "BB123"
        ),
        CustomerID = c(
            "123",
            "123",
            "123",
            "125",
            "125",
            "125",
            "123",
            "123",
            "123",
            "124",
            "124"
        ),
        "2015" = c(0, 123, 0, 0, 0, 0, 4, 0, 0, 0, 0),
        "2016" = c(0, 0, 0, 0, 0, 0, 10093, 0, 0, 234, 0),
        Class = c("R1", "R1", "R1", "R1", "R1", "R1", "R2", "R2", "R2", "R1", "R1")
    )
# initial dataset
head(Data)
#>    Region Customer ProductName CustomerID 2015 2016 Class
#> 1:    USA      ABC    Radio123        123    0    0    R1
#> 2:    USA      ABC    Radio234        123  123    0    R1
#> 3:    USA      ABC    Radio345        123    0    0    R1
#> 4:    USA      DEF    Radio123        125    0    0    R1
#> 5:    USA      DEF    Radio234        125    0    0    R1
#> 6:    USA      DEF    Radio345        125    0    0    R1

# function to find number only elements, obtained from:
# https://stackoverflow.com/questions/43195519/check-if-string-contains-only-numbers-or-only-characters-r/43195554
numbers_only <- function(x)
    ! grepl("\\D", x)
# Get the variables that are not years
id_vars <- names(Data)[!numbers_only(names(Data))]

tidy_data <-
    melt(
        Data,
        id.vars = id_vars,
        variable.name = "year",
        value.name = "volume"
    )
#new format
head(tidy_data)
#>    Region Customer ProductName CustomerID Class year volume
#> 1:    USA      ABC    Radio123        123    R1 2015      0
#> 2:    USA      ABC    Radio234        123    R1 2015    123
#> 3:    USA      ABC    Radio345        123    R1 2015      0
#> 4:    USA      DEF    Radio123        125    R1 2015      0
#> 5:    USA      DEF    Radio234        125    R1 2015      0
#> 6:    USA      DEF    Radio345        125    R1 2015      0

# end result
result <-
    tidy_data[, .(agg_volume = sum(ifelse(Class == "R1", volume, 3 * volume))) ,
              by = c("Region", "Customer", "ProductName", "CustomerID", "year")]
#final result
head(result)
#>    Region Customer ProductName CustomerID year agg_volume
#> 1:    USA      ABC    Radio123        123 2015          0
#> 2:    USA      ABC    Radio234        123 2015        123
#> 3:    USA      ABC    Radio345        123 2015          0
#> 4:    USA      DEF    Radio123        125 2015          0
#> 5:    USA      DEF    Radio234        125 2015          0
#> 6:    USA      DEF    Radio345        125 2015          0

Created on 2020-01-03 by the reprex package (v0.3.0)

P.S. I recommend you using the reprex package to create your examples as it will make your life much easier.

英文:

Your specific question can be solved in several ways.
However, the first thing I believe you should address is the data shape.
You have column names that contain values ("2015", "2016",...) which should be avoided if possible. I recommend you reading this article by Hadley Wickham about tidy data which is a great guidance in most of the cases. Furthermore, I think the Product Name could be divided into two variables (in case you want to check at Radios or BB-s at some point).

I address the first issue of column names that contain values using data.table's melt function. You can see the end result in the code below.

library(data.table)

Data &lt;-
   data.table(
       Region = &quot;USA&quot;,
       Customer = c(
           &quot;ABC&quot;,
           &quot;ABC&quot;,
           &quot;ABC&quot;,
           &quot;DEF&quot;,
           &quot;DEF&quot;,
           &quot;DEF&quot;,
           &quot;ABC&quot;,
           &quot;ABC&quot;,
           &quot;ABC&quot;,
           &quot;XYZ&quot;,
           &quot;XYZ&quot;
       ),
       ProductName = c(
           &quot;Radio123&quot;,
           &quot;Radio234&quot;,
           &quot;Radio345&quot;,
           &quot;Radio123&quot;,
           &quot;Radio234&quot;,
           &quot;Radio345&quot;,
           &quot;BB123&quot;,
           &quot;BB234&quot;,
           &quot;BB345&quot;,
           &quot;Radio123&quot;,
           &quot;BB123&quot;
       ),
       CustomerID = c(
           &quot;123&quot;,
           &quot;123&quot;,
           &quot;123&quot;,
           &quot;125&quot;,
           &quot;125&quot;,
           &quot;125&quot;,
           &quot;123&quot;,
           &quot;123&quot;,
           &quot;123&quot;,
           &quot;124&quot;,
           &quot;124&quot;
       ),
       &quot;2015&quot; = c(0, 123, 0, 0, 0, 0, 4, 0, 0, 0, 0),
       &quot;2016&quot; = c(0, 0, 0, 0, 0, 0, 10093, 0, 0, 234, 0),
       Class = c(&quot;R1&quot;, &quot;R1&quot;, &quot;R1&quot;, &quot;R1&quot;, &quot;R1&quot;, &quot;R1&quot;, &quot;R2&quot;, &quot;R2&quot;, &quot;R2&quot;, &quot;R1&quot;, &quot;R1&quot;)
   )
# initial dataset
head(Data)
#&gt;    Region Customer ProductName CustomerID 2015 2016 Class
#&gt; 1:    USA      ABC    Radio123        123    0    0    R1
#&gt; 2:    USA      ABC    Radio234        123  123    0    R1
#&gt; 3:    USA      ABC    Radio345        123    0    0    R1
#&gt; 4:    USA      DEF    Radio123        125    0    0    R1
#&gt; 5:    USA      DEF    Radio234        125    0    0    R1
#&gt; 6:    USA      DEF    Radio345        125    0    0    R1

# function to find number only elements, obtained from:
# https://stackoverflow.com/questions/43195519/check-if-string-contains-only-numbers-or-only-characters-r/43195554
numbers_only &lt;- function(x)
   ! grepl(&quot;\\D&quot;, x)
# Get the variables that are not years
id_vars &lt;- names(Data)[!numbers_only(names(Data))]


tidy_data &lt;-
   melt(
       Data,
       id.vars = id_vars,
       variable.name = &quot;year&quot;,
       value.name = &quot;volume&quot;
   )
#new format
head(tidy_data)
#&gt;    Region Customer ProductName CustomerID Class year volume
#&gt; 1:    USA      ABC    Radio123        123    R1 2015      0
#&gt; 2:    USA      ABC    Radio234        123    R1 2015    123
#&gt; 3:    USA      ABC    Radio345        123    R1 2015      0
#&gt; 4:    USA      DEF    Radio123        125    R1 2015      0
#&gt; 5:    USA      DEF    Radio234        125    R1 2015      0
#&gt; 6:    USA      DEF    Radio345        125    R1 2015      0

# end result
result &lt;-
   tidy_data[, .(agg_volume = sum(ifelse(Class == &quot;R1&quot;, volume, 3 * volume))) ,
             by = c(&quot;Region&quot;, &quot;Customer&quot;, &quot;ProductName&quot;, &quot;CustomerID&quot;, &quot;year&quot;)]
#final result
head(result)
#&gt;    Region Customer ProductName CustomerID year agg_volume
#&gt; 1:    USA      ABC    Radio123        123 2015          0
#&gt; 2:    USA      ABC    Radio234        123 2015        123
#&gt; 3:    USA      ABC    Radio345        123 2015          0
#&gt; 4:    USA      DEF    Radio123        125 2015          0
#&gt; 5:    USA      DEF    Radio234        125 2015          0
#&gt; 6:    USA      DEF    Radio345        125 2015          0

<sup>Created on 2020-01-03 by the reprex package (v0.3.0)</sup>

P.S. I recommend you using the reprex package to create your examples as it will make your life much easier.

huangapple
  • 本文由 发表于 2020年1月3日 21:36:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/59579558.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定