英文:
Aggregating data with n+m combinations of different attributes in R
问题
I have a data table containing information on the volumes for a varying product mix sold to different customers over several years.
The code for this Data structure is -
Data <- data.table(Region="USA", Customer=c("ABC","ABC","ABC","DEF","DEF","DEF","ABC","ABC","ABC","XYZ","XYZ"), ProductName=c("Radio123","Radio234","Radio345","Radio123","Radio234","Radio345","BB123","BB234","BB345","Radio123","BB123"),CustomerID=c("123","123","123","125","125","125","123","123","123","124","124"),'2015'=c(0,123,0,0,0,0,4,0,0,0,0),'2016'=c(0,0,0,0,0,0,10093,0,0,234,0),Class=c("R1","R1","R1","R1","R1","R1","R2","R2","R2","R1","R1"))
For the sake of simplicity, I have only kept 2015 and 2016.
As you can see, there are several different types (6) of products listed under the Product Name and the corresponding volumes sold are listed under the Year Number. There is a key to differentiate the customers and that is the Customer ID. You can also see under the Class that the different products essentially fall into two categories, i.e., R1 and R2.
What I would like to do is to aggregate the data on the Customer ID so I can get a total of the volume sold each year to every customer, i.e., have one row only for each customer.
This i can do with the following -
Data[, lapply(.SD, sum, na.rm = TRUE), .SDcols = c("2015", "2016"), by = c("Region", "Customer ID", "Customer")]
However, instead of a simple aggregate, I would like to pick a combination of 3R1+1R2 type and consider that as one unit of volume sold. Eg: If I aggregate for Customer ID = 123, I will get 127 (123 Radio234 + 4 BB123) under column 2015 and 10093 (BB123) under column 2016 etc. Instead I would like to show 1 unit as 3 Radio234 + 1 BB123 of complete pairs (4 under 2015 in this example) and in another column show the remaining incomplete pairs (111 Radio234 under 2015 in this example).
Any idea how this can be achieved in R? Should I use a simple divide function to get a count of complete groups in each Class and then a min to get the value of complete pairs?
I am unsure of how to program it but this is the logical I am thinking of -
-
Aggregate Class based on Customer ID which gives exactly two rows per Customer, i.e., 1 for volumes sold of type R1 and one for R2.
-
Divide R1 by 3 to get a count of complete sets that can be formed using R1 and then do a min(R1,R2) to see how many sets we can make. Eg: For 2015, we have 123/3 giving us a dividend of 41 in case of R1 and we see we have 4 R2. Then we do a min(41,4) which gives us the value of complete sets we can make using the formula (3R1+R2), thus giving us 4. We are left with 123-(3x4) = 111 R1 Class products or 37 incomplete pairs (missing R2).
Possible Output -
Data_Agg <- data.table(Region="USA",
Customer=c("ABC","DEF","XYZ"),
CustomerID=c("123","125","124"),
'2015_complete'= c(4,0,0),
'2016_complete'=c(0,0,0),
'2015_incomplete'=c("37_R1",0,0),
'2016_incomplete'=c("10093_R2",0,"78_R1"))
> Data_Agg
# Region Customer CustomerID 2015_complete 2016_complete 2015_incomplete 2016_incomplete
#1: USA ABC 123 4 0 37_R1 10093_R2
#2: USA DEF 125 0 0 0 0
#3: USA XYZ 124 0 0 0 78_R1
Or is there a more efficient way to do this? As an extension, I would also like to create an array or key value pair where I can store the Product Name composition of each row aggregate (how many Radio123 or Radio234 or BB123 make up a particular aggregated row).
Thanks for the help!
英文:
I have a data table containing information on the volumes for a varying product mix sold to different customers over several years.
The code for this Data structure is -
Data <- data.table(Region="USA", Customer=c("ABC","ABC","ABC","DEF","DEF","DEF","ABC","ABC","ABC","XYZ","XYZ"), ProductName=c("Radio123","Radio234","Radio345","Radio123","Radio234","Radio345","BB123","BB234","BB345","Radio123","BB123"),CustomerID=c("123","123","123","125","125","125","123","123","123","124","124"),'2015'=c(0,123,0,0,0,0,4,0,0,0,0),'2016'=c(0,0,0,0,0,0,10093,0,0,234,0),Class=c("R1","R1","R1","R1","R1","R1","R2","R2","R2","R1","R1"))
For the sake of simplicity, I have only kept 2015 and 2016.
As you can see, there are several different types (6) of products listed under the Product Name and the corresponding volumes sold are listed under the Year Number. There is a key to differentiate the customers and that is the Customer ID. You can also see under the Class that the different products essentially fall into two categories, i.e., R1 and R2.
What I would like to do is to aggregate the data on the Customer ID so I can get a total of the volume sold each year to every customer, i.e., have one row only for each customer.
This i can do with the following -
Data[, lapply(.SD, sum, na.rm = TRUE), .SDcols = c("2015", "2016"), by = c("Region", "Customer ID", "Customer")]
However, instead of a simple aggregate, I would like to pick a combination of 3R1+1R2 type and consider that as one unit of volume sold. Eg: If I aggregate for Customer ID = 123, I will get 127 (123 Radio234 + 4 BB123) under column 2015 and 10093 (BB123) under column 2016 etc. Instead I would like to show 1 unit as 3 Radio234 + 1 BB123 of complete pairs (4 under 2015 in this example) and in another column show the remaining incomplete pairs (111 Radio234 under 2015 in this example).
Any idea how this can be achieved in R? Should I use a simple divide function to get a count of complete groups in each Class and then a min to get the value of complete pairs?
I am unsure of how to program it but this is the logical I am thinking of -
-
Aggregate Class based on Customer ID which gives exactly two rows per Customer, i.e., 1 for volumes sold of type R1 and one for R2.
-
Divide R1 by 3 to get a count of complete sets that can be formed using R1 and then do a min(R1,R2) to see how many sets we can make. Eg: For 2015, we have 123/3 giving us a dividend of 41 in case of R1 and we see we have 4 R2. Then we do a min(41,4) which gives us the value of complete sets we can make using the formula (3R1+R2), thus giving us 4. We are left with 123-(3x4) = 111 R1 Class products or 37 incomplete pairs (missing R2).
Possible Output -
Data_Agg <- data.table(Region="USA",
Customer=c("ABC","DEF","XYZ"),
CustomerID=c("123","125","124"),
'2015_complete'= c(4,0,0),
'2016_complete'=c(0,0,0),
'2015_incomplete'=c("37_R1",0,0),
'2016_incomplete'=c("10093_R2",0,"78_R1"))
> Data_Agg
# Region Customer CustomerID 2015_complete 2016_complete 2015_incomplete 2016_incomplete
#1: USA ABC 123 4 0 37_R1 10093_R2
#2: USA DEF 125 0 0 0 0
#3: USA XYZ 124 0 0 0 78_R1
Or is there a more efficient way to do this? As an extension, I would also like to create an array or key value pair where I can store the Product Name composition of each row aggregate (how many Radio123 or Radio234 or BB123 make up a particular aggregated row).
Thanks for the help!
答案1
得分: 1
以下是您要翻译的内容:
"Your specific question can be solved in several ways.
However, the first thing I believe you should address is the data shape.
You have column names that contain values ("2015", "2016",...) which should be avoided if possible. I recommend you reading this article by Hadley Wickham about tidy data which is a great guidance in most of the cases. Furthermore, I think the Product Name
could be divided into two variables (in case you want to check at Radios or BB-s at some point).
I address the first issue of column names that contain values using data.table
's melt
function. You can see the end result in the code below."
library(data.table)
Data <-
data.table(
Region = "USA",
Customer = c(
"ABC",
"ABC",
"ABC",
"DEF",
"DEF",
"DEF",
"ABC",
"ABC",
"ABC",
"XYZ",
"XYZ"
),
ProductName = c(
"Radio123",
"Radio234",
"Radio345",
"Radio123",
"Radio234",
"Radio345",
"BB123",
"BB234",
"BB345",
"Radio123",
"BB123"
),
CustomerID = c(
"123",
"123",
"123",
"125",
"125",
"125",
"123",
"123",
"123",
"124",
"124"
),
"2015" = c(0, 123, 0, 0, 0, 0, 4, 0, 0, 0, 0),
"2016" = c(0, 0, 0, 0, 0, 0, 10093, 0, 0, 234, 0),
Class = c("R1", "R1", "R1", "R1", "R1", "R1", "R2", "R2", "R2", "R1", "R1")
)
# initial dataset
head(Data)
#> Region Customer ProductName CustomerID 2015 2016 Class
#> 1: USA ABC Radio123 123 0 0 R1
#> 2: USA ABC Radio234 123 123 0 R1
#> 3: USA ABC Radio345 123 0 0 R1
#> 4: USA DEF Radio123 125 0 0 R1
#> 5: USA DEF Radio234 125 0 0 R1
#> 6: USA DEF Radio345 125 0 0 R1
# function to find number only elements, obtained from:
# https://stackoverflow.com/questions/43195519/check-if-string-contains-only-numbers-or-only-characters-r/43195554
numbers_only <- function(x)
! grepl("\\D", x)
# Get the variables that are not years
id_vars <- names(Data)[!numbers_only(names(Data))]
tidy_data <-
melt(
Data,
id.vars = id_vars,
variable.name = "year",
value.name = "volume"
)
#new format
head(tidy_data)
#> Region Customer ProductName CustomerID Class year volume
#> 1: USA ABC Radio123 123 R1 2015 0
#> 2: USA ABC Radio234 123 R1 2015 123
#> 3: USA ABC Radio345 123 R1 2015 0
#> 4: USA DEF Radio123 125 R1 2015 0
#> 5: USA DEF Radio234 125 R1 2015 0
#> 6: USA DEF Radio345 125 R1 2015 0
# end result
result <-
tidy_data[, .(agg_volume = sum(ifelse(Class == "R1", volume, 3 * volume))) ,
by = c("Region", "Customer", "ProductName", "CustomerID", "year")]
#final result
head(result)
#> Region Customer ProductName CustomerID year agg_volume
#> 1: USA ABC Radio123 123 2015 0
#> 2: USA ABC Radio234 123 2015 123
#> 3: USA ABC Radio345 123 2015 0
#> 4: USA DEF Radio123 125 2015 0
#> 5: USA DEF Radio234 125 2015 0
#> 6: USA DEF Radio345 125 2015 0
Created on 2020-01-03 by the reprex package (v0.3.0)
P.S. I recommend you using the reprex
package to create your examples as it will make your life much easier.
英文:
Your specific question can be solved in several ways.
However, the first thing I believe you should address is the data shape.
You have column names that contain values ("2015", "2016",...) which should be avoided if possible. I recommend you reading this article by Hadley Wickham about tidy data which is a great guidance in most of the cases. Furthermore, I think the Product Name
could be divided into two variables (in case you want to check at Radios or BB-s at some point).
I address the first issue of column names that contain values using data.table
's melt
function. You can see the end result in the code below.
library(data.table)
Data <-
data.table(
Region = "USA",
Customer = c(
"ABC",
"ABC",
"ABC",
"DEF",
"DEF",
"DEF",
"ABC",
"ABC",
"ABC",
"XYZ",
"XYZ"
),
ProductName = c(
"Radio123",
"Radio234",
"Radio345",
"Radio123",
"Radio234",
"Radio345",
"BB123",
"BB234",
"BB345",
"Radio123",
"BB123"
),
CustomerID = c(
"123",
"123",
"123",
"125",
"125",
"125",
"123",
"123",
"123",
"124",
"124"
),
"2015" = c(0, 123, 0, 0, 0, 0, 4, 0, 0, 0, 0),
"2016" = c(0, 0, 0, 0, 0, 0, 10093, 0, 0, 234, 0),
Class = c("R1", "R1", "R1", "R1", "R1", "R1", "R2", "R2", "R2", "R1", "R1")
)
# initial dataset
head(Data)
#> Region Customer ProductName CustomerID 2015 2016 Class
#> 1: USA ABC Radio123 123 0 0 R1
#> 2: USA ABC Radio234 123 123 0 R1
#> 3: USA ABC Radio345 123 0 0 R1
#> 4: USA DEF Radio123 125 0 0 R1
#> 5: USA DEF Radio234 125 0 0 R1
#> 6: USA DEF Radio345 125 0 0 R1
# function to find number only elements, obtained from:
# https://stackoverflow.com/questions/43195519/check-if-string-contains-only-numbers-or-only-characters-r/43195554
numbers_only <- function(x)
! grepl("\\D", x)
# Get the variables that are not years
id_vars <- names(Data)[!numbers_only(names(Data))]
tidy_data <-
melt(
Data,
id.vars = id_vars,
variable.name = "year",
value.name = "volume"
)
#new format
head(tidy_data)
#> Region Customer ProductName CustomerID Class year volume
#> 1: USA ABC Radio123 123 R1 2015 0
#> 2: USA ABC Radio234 123 R1 2015 123
#> 3: USA ABC Radio345 123 R1 2015 0
#> 4: USA DEF Radio123 125 R1 2015 0
#> 5: USA DEF Radio234 125 R1 2015 0
#> 6: USA DEF Radio345 125 R1 2015 0
# end result
result <-
tidy_data[, .(agg_volume = sum(ifelse(Class == "R1", volume, 3 * volume))) ,
by = c("Region", "Customer", "ProductName", "CustomerID", "year")]
#final result
head(result)
#> Region Customer ProductName CustomerID year agg_volume
#> 1: USA ABC Radio123 123 2015 0
#> 2: USA ABC Radio234 123 2015 123
#> 3: USA ABC Radio345 123 2015 0
#> 4: USA DEF Radio123 125 2015 0
#> 5: USA DEF Radio234 125 2015 0
#> 6: USA DEF Radio345 125 2015 0
<sup>Created on 2020-01-03 by the reprex package (v0.3.0)</sup>
P.S. I recommend you using the reprex
package to create your examples as it will make your life much easier.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论