在R中聚合不同属性的n+m组合数据。

huangapple go评论108阅读模式
英文:

Aggregating data with n+m combinations of different attributes in R

问题

I have a data table containing information on the volumes for a varying product mix sold to different customers over several years.

在R中聚合不同属性的n+m组合数据。

The code for this Data structure is -

  1. Data <- data.table(Region="USA", Customer=c("ABC","ABC","ABC","DEF","DEF","DEF","ABC","ABC","ABC","XYZ","XYZ"), ProductName=c("Radio123","Radio234","Radio345","Radio123","Radio234","Radio345","BB123","BB234","BB345","Radio123","BB123"),CustomerID=c("123","123","123","125","125","125","123","123","123","124","124"),'2015'=c(0,123,0,0,0,0,4,0,0,0,0),'2016'=c(0,0,0,0,0,0,10093,0,0,234,0),Class=c("R1","R1","R1","R1","R1","R1","R2","R2","R2","R1","R1"))

For the sake of simplicity, I have only kept 2015 and 2016.
As you can see, there are several different types (6) of products listed under the Product Name and the corresponding volumes sold are listed under the Year Number. There is a key to differentiate the customers and that is the Customer ID. You can also see under the Class that the different products essentially fall into two categories, i.e., R1 and R2.

What I would like to do is to aggregate the data on the Customer ID so I can get a total of the volume sold each year to every customer, i.e., have one row only for each customer.

This i can do with the following -

  1. Data[, lapply(.SD, sum, na.rm = TRUE), .SDcols = c("2015", "2016"), by = c("Region", "Customer ID", "Customer")]

However, instead of a simple aggregate, I would like to pick a combination of 3R1+1R2 type and consider that as one unit of volume sold. Eg: If I aggregate for Customer ID = 123, I will get 127 (123 Radio234 + 4 BB123) under column 2015 and 10093 (BB123) under column 2016 etc. Instead I would like to show 1 unit as 3 Radio234 + 1 BB123 of complete pairs (4 under 2015 in this example) and in another column show the remaining incomplete pairs (111 Radio234 under 2015 in this example).

Any idea how this can be achieved in R? Should I use a simple divide function to get a count of complete groups in each Class and then a min to get the value of complete pairs?

I am unsure of how to program it but this is the logical I am thinking of -

  1. Aggregate Class based on Customer ID which gives exactly two rows per Customer, i.e., 1 for volumes sold of type R1 and one for R2.

  2. Divide R1 by 3 to get a count of complete sets that can be formed using R1 and then do a min(R1,R2) to see how many sets we can make. Eg: For 2015, we have 123/3 giving us a dividend of 41 in case of R1 and we see we have 4 R2. Then we do a min(41,4) which gives us the value of complete sets we can make using the formula (3R1+R2), thus giving us 4. We are left with 123-(3x4) = 111 R1 Class products or 37 incomplete pairs (missing R2).

Possible Output -

  1. Data_Agg <- data.table(Region="USA",
  2. Customer=c("ABC","DEF","XYZ"),
  3. CustomerID=c("123","125","124"),
  4. '2015_complete'= c(4,0,0),
  5. '2016_complete'=c(0,0,0),
  6. '2015_incomplete'=c("37_R1",0,0),
  7. '2016_incomplete'=c("10093_R2",0,"78_R1"))
  8. > Data_Agg
  9. # Region Customer CustomerID 2015_complete 2016_complete 2015_incomplete 2016_incomplete
  10. #1: USA ABC 123 4 0 37_R1 10093_R2
  11. #2: USA DEF 125 0 0 0 0
  12. #3: USA XYZ 124 0 0 0 78_R1

Or is there a more efficient way to do this? As an extension, I would also like to create an array or key value pair where I can store the Product Name composition of each row aggregate (how many Radio123 or Radio234 or BB123 make up a particular aggregated row).

Thanks for the help!

英文:

I have a data table containing information on the volumes for a varying product mix sold to different customers over several years.

在R中聚合不同属性的n+m组合数据。

The code for this Data structure is -

  1. Data <- data.table(Region="USA", Customer=c("ABC","ABC","ABC","DEF","DEF","DEF","ABC","ABC","ABC","XYZ","XYZ"), ProductName=c("Radio123","Radio234","Radio345","Radio123","Radio234","Radio345","BB123","BB234","BB345","Radio123","BB123"),CustomerID=c("123","123","123","125","125","125","123","123","123","124","124"),'2015'=c(0,123,0,0,0,0,4,0,0,0,0),'2016'=c(0,0,0,0,0,0,10093,0,0,234,0),Class=c("R1","R1","R1","R1","R1","R1","R2","R2","R2","R1","R1"))

For the sake of simplicity, I have only kept 2015 and 2016.
As you can see, there are several different types (6) of products listed under the Product Name and the corresponding volumes sold are listed under the Year Number. There is a key to differentiate the customers and that is the Customer ID. You can also see under the Class that the different products essentially fall into two categories, i.e., R1 and R2.

What I would like to do is to aggregate the data on the Customer ID so I can get a total of the volume sold each year to every customer, i.e., have one row only for each customer.

This i can do with the following -

  1. Data[, lapply(.SD, sum, na.rm = TRUE), .SDcols = c("2015", "2016"), by = c("Region", "Customer ID", "Customer")]

However, instead of a simple aggregate, I would like to pick a combination of 3R1+1R2 type and consider that as one unit of volume sold. Eg: If I aggregate for Customer ID = 123, I will get 127 (123 Radio234 + 4 BB123) under column 2015 and 10093 (BB123) under column 2016 etc. Instead I would like to show 1 unit as 3 Radio234 + 1 BB123 of complete pairs (4 under 2015 in this example) and in another column show the remaining incomplete pairs (111 Radio234 under 2015 in this example).

Any idea how this can be achieved in R? Should I use a simple divide function to get a count of complete groups in each Class and then a min to get the value of complete pairs?

I am unsure of how to program it but this is the logical I am thinking of -

  1. Aggregate Class based on Customer ID which gives exactly two rows per Customer, i.e., 1 for volumes sold of type R1 and one for R2.

  2. Divide R1 by 3 to get a count of complete sets that can be formed using R1 and then do a min(R1,R2) to see how many sets we can make. Eg: For 2015, we have 123/3 giving us a dividend of 41 in case of R1 and we see we have 4 R2. Then we do a min(41,4) which gives us the value of complete sets we can make using the formula (3R1+R2), thus giving us 4. We are left with 123-(3x4) = 111 R1 Class products or 37 incomplete pairs (missing R2).

Possible Output -

  1. Data_Agg <- data.table(Region="USA",
  2. Customer=c("ABC","DEF","XYZ"),
  3. CustomerID=c("123","125","124"),
  4. '2015_complete'= c(4,0,0),
  5. '2016_complete'=c(0,0,0),
  6. '2015_incomplete'=c("37_R1",0,0),
  7. '2016_incomplete'=c("10093_R2",0,"78_R1"))
  8. > Data_Agg
  9. # Region Customer CustomerID 2015_complete 2016_complete 2015_incomplete 2016_incomplete
  10. #1: USA ABC 123 4 0 37_R1 10093_R2
  11. #2: USA DEF 125 0 0 0 0
  12. #3: USA XYZ 124 0 0 0 78_R1

Or is there a more efficient way to do this? As an extension, I would also like to create an array or key value pair where I can store the Product Name composition of each row aggregate (how many Radio123 or Radio234 or BB123 make up a particular aggregated row).

Thanks for the help!

答案1

得分: 1

以下是您要翻译的内容:

"Your specific question can be solved in several ways.
However, the first thing I believe you should address is the data shape.
You have column names that contain values ("2015", "2016",...) which should be avoided if possible. I recommend you reading this article by Hadley Wickham about tidy data which is a great guidance in most of the cases. Furthermore, I think the Product Name could be divided into two variables (in case you want to check at Radios or BB-s at some point).

I address the first issue of column names that contain values using data.table's melt function. You can see the end result in the code below."

  1. library(data.table)
  2. Data <-
  3. data.table(
  4. Region = "USA",
  5. Customer = c(
  6. "ABC",
  7. "ABC",
  8. "ABC",
  9. "DEF",
  10. "DEF",
  11. "DEF",
  12. "ABC",
  13. "ABC",
  14. "ABC",
  15. "XYZ",
  16. "XYZ"
  17. ),
  18. ProductName = c(
  19. "Radio123",
  20. "Radio234",
  21. "Radio345",
  22. "Radio123",
  23. "Radio234",
  24. "Radio345",
  25. "BB123",
  26. "BB234",
  27. "BB345",
  28. "Radio123",
  29. "BB123"
  30. ),
  31. CustomerID = c(
  32. "123",
  33. "123",
  34. "123",
  35. "125",
  36. "125",
  37. "125",
  38. "123",
  39. "123",
  40. "123",
  41. "124",
  42. "124"
  43. ),
  44. "2015" = c(0, 123, 0, 0, 0, 0, 4, 0, 0, 0, 0),
  45. "2016" = c(0, 0, 0, 0, 0, 0, 10093, 0, 0, 234, 0),
  46. Class = c("R1", "R1", "R1", "R1", "R1", "R1", "R2", "R2", "R2", "R1", "R1")
  47. )
  48. # initial dataset
  49. head(Data)
  50. #> Region Customer ProductName CustomerID 2015 2016 Class
  51. #> 1: USA ABC Radio123 123 0 0 R1
  52. #> 2: USA ABC Radio234 123 123 0 R1
  53. #> 3: USA ABC Radio345 123 0 0 R1
  54. #> 4: USA DEF Radio123 125 0 0 R1
  55. #> 5: USA DEF Radio234 125 0 0 R1
  56. #> 6: USA DEF Radio345 125 0 0 R1
  57. # function to find number only elements, obtained from:
  58. # https://stackoverflow.com/questions/43195519/check-if-string-contains-only-numbers-or-only-characters-r/43195554
  59. numbers_only <- function(x)
  60. ! grepl("\\D", x)
  61. # Get the variables that are not years
  62. id_vars <- names(Data)[!numbers_only(names(Data))]
  63. tidy_data <-
  64. melt(
  65. Data,
  66. id.vars = id_vars,
  67. variable.name = "year",
  68. value.name = "volume"
  69. )
  70. #new format
  71. head(tidy_data)
  72. #> Region Customer ProductName CustomerID Class year volume
  73. #> 1: USA ABC Radio123 123 R1 2015 0
  74. #> 2: USA ABC Radio234 123 R1 2015 123
  75. #> 3: USA ABC Radio345 123 R1 2015 0
  76. #> 4: USA DEF Radio123 125 R1 2015 0
  77. #> 5: USA DEF Radio234 125 R1 2015 0
  78. #> 6: USA DEF Radio345 125 R1 2015 0
  79. # end result
  80. result <-
  81. tidy_data[, .(agg_volume = sum(ifelse(Class == "R1", volume, 3 * volume))) ,
  82. by = c("Region", "Customer", "ProductName", "CustomerID", "year")]
  83. #final result
  84. head(result)
  85. #> Region Customer ProductName CustomerID year agg_volume
  86. #> 1: USA ABC Radio123 123 2015 0
  87. #> 2: USA ABC Radio234 123 2015 123
  88. #> 3: USA ABC Radio345 123 2015 0
  89. #> 4: USA DEF Radio123 125 2015 0
  90. #> 5: USA DEF Radio234 125 2015 0
  91. #> 6: USA DEF Radio345 125 2015 0

Created on 2020-01-03 by the reprex package (v0.3.0)

P.S. I recommend you using the reprex package to create your examples as it will make your life much easier.

英文:

Your specific question can be solved in several ways.
However, the first thing I believe you should address is the data shape.
You have column names that contain values ("2015", "2016",...) which should be avoided if possible. I recommend you reading this article by Hadley Wickham about tidy data which is a great guidance in most of the cases. Furthermore, I think the Product Name could be divided into two variables (in case you want to check at Radios or BB-s at some point).

I address the first issue of column names that contain values using data.table's melt function. You can see the end result in the code below.

  1. library(data.table)
  2. Data &lt;-
  3. data.table(
  4. Region = &quot;USA&quot;,
  5. Customer = c(
  6. &quot;ABC&quot;,
  7. &quot;ABC&quot;,
  8. &quot;ABC&quot;,
  9. &quot;DEF&quot;,
  10. &quot;DEF&quot;,
  11. &quot;DEF&quot;,
  12. &quot;ABC&quot;,
  13. &quot;ABC&quot;,
  14. &quot;ABC&quot;,
  15. &quot;XYZ&quot;,
  16. &quot;XYZ&quot;
  17. ),
  18. ProductName = c(
  19. &quot;Radio123&quot;,
  20. &quot;Radio234&quot;,
  21. &quot;Radio345&quot;,
  22. &quot;Radio123&quot;,
  23. &quot;Radio234&quot;,
  24. &quot;Radio345&quot;,
  25. &quot;BB123&quot;,
  26. &quot;BB234&quot;,
  27. &quot;BB345&quot;,
  28. &quot;Radio123&quot;,
  29. &quot;BB123&quot;
  30. ),
  31. CustomerID = c(
  32. &quot;123&quot;,
  33. &quot;123&quot;,
  34. &quot;123&quot;,
  35. &quot;125&quot;,
  36. &quot;125&quot;,
  37. &quot;125&quot;,
  38. &quot;123&quot;,
  39. &quot;123&quot;,
  40. &quot;123&quot;,
  41. &quot;124&quot;,
  42. &quot;124&quot;
  43. ),
  44. &quot;2015&quot; = c(0, 123, 0, 0, 0, 0, 4, 0, 0, 0, 0),
  45. &quot;2016&quot; = c(0, 0, 0, 0, 0, 0, 10093, 0, 0, 234, 0),
  46. Class = c(&quot;R1&quot;, &quot;R1&quot;, &quot;R1&quot;, &quot;R1&quot;, &quot;R1&quot;, &quot;R1&quot;, &quot;R2&quot;, &quot;R2&quot;, &quot;R2&quot;, &quot;R1&quot;, &quot;R1&quot;)
  47. )
  48. # initial dataset
  49. head(Data)
  50. #&gt; Region Customer ProductName CustomerID 2015 2016 Class
  51. #&gt; 1: USA ABC Radio123 123 0 0 R1
  52. #&gt; 2: USA ABC Radio234 123 123 0 R1
  53. #&gt; 3: USA ABC Radio345 123 0 0 R1
  54. #&gt; 4: USA DEF Radio123 125 0 0 R1
  55. #&gt; 5: USA DEF Radio234 125 0 0 R1
  56. #&gt; 6: USA DEF Radio345 125 0 0 R1
  57. # function to find number only elements, obtained from:
  58. # https://stackoverflow.com/questions/43195519/check-if-string-contains-only-numbers-or-only-characters-r/43195554
  59. numbers_only &lt;- function(x)
  60. ! grepl(&quot;\\D&quot;, x)
  61. # Get the variables that are not years
  62. id_vars &lt;- names(Data)[!numbers_only(names(Data))]
  63. tidy_data &lt;-
  64. melt(
  65. Data,
  66. id.vars = id_vars,
  67. variable.name = &quot;year&quot;,
  68. value.name = &quot;volume&quot;
  69. )
  70. #new format
  71. head(tidy_data)
  72. #&gt; Region Customer ProductName CustomerID Class year volume
  73. #&gt; 1: USA ABC Radio123 123 R1 2015 0
  74. #&gt; 2: USA ABC Radio234 123 R1 2015 123
  75. #&gt; 3: USA ABC Radio345 123 R1 2015 0
  76. #&gt; 4: USA DEF Radio123 125 R1 2015 0
  77. #&gt; 5: USA DEF Radio234 125 R1 2015 0
  78. #&gt; 6: USA DEF Radio345 125 R1 2015 0
  79. # end result
  80. result &lt;-
  81. tidy_data[, .(agg_volume = sum(ifelse(Class == &quot;R1&quot;, volume, 3 * volume))) ,
  82. by = c(&quot;Region&quot;, &quot;Customer&quot;, &quot;ProductName&quot;, &quot;CustomerID&quot;, &quot;year&quot;)]
  83. #final result
  84. head(result)
  85. #&gt; Region Customer ProductName CustomerID year agg_volume
  86. #&gt; 1: USA ABC Radio123 123 2015 0
  87. #&gt; 2: USA ABC Radio234 123 2015 123
  88. #&gt; 3: USA ABC Radio345 123 2015 0
  89. #&gt; 4: USA DEF Radio123 125 2015 0
  90. #&gt; 5: USA DEF Radio234 125 2015 0
  91. #&gt; 6: USA DEF Radio345 125 2015 0

<sup>Created on 2020-01-03 by the reprex package (v0.3.0)</sup>

P.S. I recommend you using the reprex package to create your examples as it will make your life much easier.

huangapple
  • 本文由 发表于 2020年1月3日 21:36:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/59579558.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定