英文:
Subtract the sum of multiple values from a value in a different dataset across multiple groups
问题
以下是翻译好的部分:
我有两个相关的数据集,一个包含不同类型的水果,包括柑橘类水果的计数和总数,另一个包含只有柑橘类水果的计数。两个数据集都包含来自相同地区的数据,我需要按地区从df1中减去df2中的柑橘类水果的数量。
数据:
```R
set.seed(123)
region1 <- as.factor(rep(c('north', 'north-east', 'east', 'south-east', 'south',
'south-west', 'west', 'north-west', 'centre', 'islands'),
each = 6))
fruit <- as.factor(rep(c('2. citrus', '3. pear and orange', '5. bananas',
'1. kiwi and lemon', '6. plums', '4. apple and lime'), 10))
count1 <- as.integer(signif(rnorm(60, mean = 2000, sd = 500)), 1)
gtotal1 <- as.numeric(round(rnorm(60, mean = 20000, sd = 5000)), 0)
df1 <- data.frame(region1, fruit, count1, gtotal1)
region2 <- as.factor(rep(c('north', 'north-east', 'east', 'south-east', 'south',
'south-west', 'west', 'north-west', 'centre',
'islands'), each = 7))
citrus <- as.factor(rep(c('6. lisbon (lemon)', '24. easy p. (orange)', '25. navel (orange)',
'37. blood (orange)', '37. tang. (orange)', '43. mand. (orange)',
'46. key (lime)'), 10))
count2 <- as.integer(signif(rnorm(70, mean = 2000, sd = 500)), 1)
gtotal2 <- as.numeric(round(rnorm(70, mean = 20000, sd = 5000)), 0)
df2 <- data.frame(region2, citrus, count2, gtotal2)
在df1中,不同柑橘类水果的计数和总数与其他种类的水果(例如猕猴桃和柠檬)一起包括在内,“柑橘类”类别是通过其他方式创建的,以给予它们自己的类别,但不同柑橘类水果的值仍包括在其他成对的类别中。这个问题存在于所有10个地区中。
df2包含每个地区中与df1类别中成对的柑橘类水果的计数。我需要从df1中的每个地区的类别中减去df2中柠檬、橙子和酸橙的总数。
这种情况和数据是合成的,所以忽略所给值中的任何错误。我需要对计数和gtotal列执行此操作。
这是两个数据集:
df1 df2
region1 fruit count1 gtotal1 | region2 citrus count2 gtotal2
1 north 2. citrus 1719 21898 | north 6. lisbon (lemon) 2058 21072
2 north 3. pear and orange 1884 17488 | north 24. easy p. (orange) 1526 18377
3 north 5. bananas 2779 18334 | north 25. navel (orange) 1754 20473
4 north 1. kiwi and lemon 2035 14907 | north 37. blood (orange) 1871 15523
5 north 6. plums 2064 14641 | north 38. tang. (orange) 2921 13446
6 north 4. apple and lime 2857 21518 | north 43. mand. (orange) 1674 29986
7 north-east 2. citrus 2230 22241 | north 46. key (lime) 2117 23004
8 north-east 3. pear and orange 1367 20265 | north-east 6. lisbon (lemon) 2038 13744
9 north-east 5. bananas 1656 24611 | north-east 24. easy p. (orange) 1519 16944
10 north-east 1. kiwi and lemon 1777 30250 | north-east 25. navel (orange) 1964 14073
11 north-east 6. plums 2612 17545 | north-east 37. blood (orange) 2722 30994
12 north-east 4. apple and lime 2179 8454 | north-east 38. tang. (orange) 2225 26562
13 east 2. citrus 2200 25029 | north-east 43. mand. (orange) 2020 18674
14 east 3. pear and orange 2055 16454 | north-east 46. key (lime) 1788 22716
15 east 5. bananas 1722 16560 | east 6. lisbon (lemon) 973 17928
...
以下是我想要获得的内容:
df3
region fruit count gtotal
1 north 2. citrus 1719 21898
2 north 3. pear and orange -7862 -80317
3 north 5. bananas 2779 18334
4 north 1. kiwi and lemon -23 -6165
5 north 6. plums 2064 14641
6 north 4. apple and lime 740 -1486
7 north-east 2. citrus 2230 22241
8 north-east 3. pear and orange -9083 -86982
9 north-east 5. bananas 1656 24611
10 north-east 1. kiwi and lemon
<details>
<summary>英文:</summary>
I have two related datasets, one contains the counts and grand totals of different types of fruits, including citrus fruits, and the other contains counts of just citrus fruits. Both datasets contain data from the same regions and I need to subtract the numbers of citrus fruits in df2 from df1, per region.
The data:
set.seed(123)
region1 <- as.factor(rep(c('north', 'north-east', 'east', 'south-east', 'south',
'south-west', 'west', 'north-west', 'centre', 'islands'),
each = 6))
fruit <- as.factor(rep(c('2. citrus', '3. pear and orange', '5. bananas',
'1. kiwi and lemon', '6. plums', '4. apple and lime'), 10))
count1 <- as.integer(signif(rnorm(60, mean = 2000, sd = 500)), 1)
gtotal1 <- as.numeric(round(rnorm(60, mean = 20000, sd = 5000)), 0)
df1 <- data.frame(region1, fruit, count1, gtotal1)
region2 <- as.factor(rep(c('north', 'north-east', 'east', 'south-east', 'south',
'south-west', 'west', 'north-west', 'centre',
'islands'),
each = 7))
citrus <- as.factor(rep(c('6. lisbon (lemon)', '24. easy p. (orange)', '25. navel (orange)',
'37. blood (orange)', '37. tang. (orange)', '43. mand. (orange)',
'46. key (lime)'), 10))
count2 <- as.integer(signif(rnorm(70, mean = 2000, sd = 500)), 1)
gtotal2 <- as.numeric(round(rnorm(70, mean = 20000, sd = 5000)), 0)
df2 <- data.frame(region2, citrus, count2, gtotal2)
In df1, the counts and gtotals of different citrus fruits were included with other kinds of fruits (e.g. kiwis and lemons), the "citrus" category was created via other means to give them their own category, but the values of the different citrus fruits are still included in the other paired categories. This issue is present for all 10 regions.
df2 contains the counts of these citrus fruits within the paired df1 categories, per region. I need to subtract the total number of lemons, oranges, and limes in df2 from their categories in df1 for each region.
This situation and data is synthetic, so ignore any errors in the values given. I need to do this for the count and gtotal columns.
Here are the two datasets:
df1 df2
region1 fruit count1 gtotal1 | region2 citrus count2 gtotal2
1 north 2. citrus 1719 21898 | north 6. lisbon (lemon) 2058 21072
2 north 3. pear and orange 1884 17488 | north 24. easy p. (orange) 1526 18377
3 north 5. bananas 2779 18334 | north 25. navel (orange) 1754 20473
4 north 1. kiwi and lemon 2035 14907 | north 37. blood (orange) 1871 15523
5 north 6. plums 2064 14641 | north 38. tang. (orange) 2921 13446
6 north 4. apple and lime 2857 21518 | north 43. mand. (orange) 1674 29986
7 north-east 2. citrus 2230 22241 | north 46. key (lime) 2117 23004
8 north-east 3. pear and orange 1367 20265 | north-east 6. lisbon (lemon) 2038 13744
9 north-east 5. bananas 1656 24611 | north-east 24. easy p. (orange) 1519 16944
10 north-east 1. kiwi and lemon 1777 30250 | north-east 25. navel (orange) 1964 14073
11 north-east 6. plums 2612 17545 | north-east 37. blood (orange) 2722 30994
12 north-east 4. apple and lime 2179 8454 | north-east 38. tang. (orange) 2225 26562
13 east 2. citrus 2200 25029 | north-east 43. mand. (orange) 2020 18674
14 east 3. pear and orange 2055 16454 | north-east 46. key (lime) 1788 22716
15 east 5. bananas 1722 16560 | east 6. lisbon (lemon) 973 17928
...
And here is what I would like to obtain:
df3
region fruit count gtotal
1 north 2. citrus 1719 21898
2 north 3. pear and orange -7862 -80317
3 north 5. bananas 2779 18334
4 north 1. kiwi and lemon -23 -6165
5 north 6. plums 2064 14641
6 north 4. apple and lime 740 -1486
7 north-east 2. citrus 2230 22241
8 north-east 3. pear and orange -9083 -86982
9 north-east 5. bananas 1656 24611
10 north-east 1. kiwi and lemon -261 16506
11 north-east 6. plums 2612 17545
12 north-east 4. apple and lime 391 -14262
13 east 2. citrus 2200 25029
14 east 3. pear and orange -8380 -82234
15 east 5. bananas 1722 16560
...
I know how to do this manually, by splitting the data by region and doing subtractions using base R commands, but this would be time consuming I am sure there is a better way to do this using `dplyr` or `ifelse()` statements, and I would like to learn this for future issues.
Thank you in advance!
</details>
# 答案1
**得分**: 1
### 新回答
根据我所了解的情况,你想要在两个数据集中比较的列(df1$fruit, df2$citrus)没有任何标识特征来进行匹配。因此,我建议手动创建一个映射表,以便你可以正确地将它们连接起来。我在这里创建了一个名为 `fruit_map` 的数据框。
接下来的步骤实际上比以前要简单一些,你可以使用 `fruit_map` 表来按 `fruit_grp` 聚合你的柑橘类别。
我认为手动聚合每种类型并不值得。这样容易出错,遵循不良的编程技巧,难以阅读等等。
希望这对你的用例更加适用。
```R
# 创建水果和柑橘之间的关系
fruit_map = data.frame(
fruit_grp = as.factor(c("1. kiwi and lemon",rep("3. pear and orange",5),"4. apple and lime")),
citrus = as.factor(c("6. lisbon (lemon)","24. easy p. (orange)","25. navel (orange)","37. blood (orange)","37. tang. (orange)","43. mand. (orange)","46. key (lime)"))
)
df2_by_grp = df2 %>%
left_join(fruit_map, by = "citrus") %>%
group_by(region2, fruit_grp) %>%
summarise(across(c(count2, gtotal2), ~ sum(., na.rm = T))) %>%
rename("fruit" = "fruit_grp")
agg_dt = df1 %>%
inner_join(df2_by_grp, by = c("region1" = "region2", "fruit")) %>%
mutate(count1 = count1 - count2, gtotal1 = gtotal1 - gtotal2) %>%
select(region1:gtotal1)
df3 = df1 %>%
rows_update(agg_dt, by = c("region1", "fruit")) %>%
rename_with(~ gsub("1$", "", .x)) # 更新列名
旧回答
以下是可能的解决方案。
首先,我们创建了 citrus_cat
,其中只保留柑橘名称,而不包括其类型。例如,“lemon (lisbon)” -> “lemon”。
然后,按 region2
和 citrus_cat
分组,以获取这些组合的 count2
和 gtotal2
的总和。
接下来,我们将新的 df2
左连接到原始的 df1
,并筛选其中 citrus_cat
存在于 fruit
中的部分。在这里,我们从 df1
中减去了在 df2
中找到的总数。
最后,由于在 df1 中存在未在新的聚合表中找到匹配项的项目。我们使用 row_update
仅更新了有匹配项的行。
如果你有任何问题,请告诉我!
library(dplyr)
library(stringr)
df2_by_grp = df2 %>%
mutate(citrus_cat = word(citrus, 2)) %>% # 提取第二个单词
group_by(region2, citrus_cat) %>%
summarise(across(c(count2, gtotal2), ~ sum(., na.rm = T)))
agg_dt = df1 %>%
left_join(df2_by_grp, by = c("region1" = "region2")) %>%
mutate(fruit = as.character(fruit)) %>%
rowwise() %>%
filter(str_detect(fruit, citrus_cat)) %>%
mutate(count1 = count1 - count2, gtotal1 = gtotal1 - gtotal2) %>%
select(region1:gtotal1)
df3 = df1 %>%
rows_update(agg_dt, by = c("region1", "fruit")) %>%
rename_with(~ gsub("1$", "", .x)) # 更新列名
> df3
region fruit count gtotal
1 north 2. citrus 1719 21898
2 north 3. pear and orange -7862 -80317
3 north 5. bananas 2779 18334
4 north 1. kiwi and lemon -23 -6165
5 north 6. plums 2064 14641
6 north 4. apple and lime 740 -1486
7 north-east 2. citrus 2230 22241
8 north-east 3. pear and orange -9083 -86982
9 north-east 5. bananas 1656 24611
10 north-east 1. kiwi and lemon -261 16506
11 north-east 6. plums 2612 17545
12 north-east 4. apple and lime 391 -14262
13 east 2. citrus 2200 25029
14 east 3. pear and orange -8380 -82234
15 east 5. bananas 1722 16560
16 east 1. kiwi and lemon 1920 7200
17 east 6. plums 2248 18576
18 east 4. apple and lime -1334 -6700
...
英文:
New Answer
From what I'm gathering, the cols that you want to compare in your two datasets (df1$fruit, df2$citrus), don't have any identifying traits to match them. Therefore, I would recommend manually creating a mapping table so you can link everything up properly. I created a fruit_map
df here.
Continuing on - the process is actually a bit more straight forward than before where you use the fruit_map table to aggregate your citrus types by fruit_grp
.
I don't see value in manually aggregating every type like you're suggesting. That can be prone to errors, follows poor coding techniques, difficult to read etc.
Hopefully, this is more applicable for your use case.
# Create relationship between fruit and citrus
fruit_map = data.frame(
fruit_grp = as.factor(c("1. kiwi and lemon",rep("3. pear and orange",5),"4. apple and lime")),
citrus = as.factor(c("6. lisbon (lemon)","24. easy p. (orange)","25. navel (orange)","37. blood (orange)","37. tang. (orange)","43. mand. (orange)","46. key (lime)"))
)
df2_by_grp = df2 %>%
left_join(fruit_map, by = "citrus") %>%
group_by(region2, fruit_grp) %>%
summarise(across(c(count2, gtotal2), ~ sum(., na.rm = T))) %>%
rename("fruit" = "fruit_grp")
agg_dt = df1 %>%
inner_join(df2_by_grp, by = c("region1" = "region2", "fruit")) %>%
mutate(count1 = count1 - count2, gtotal1 = gtotal1 - gtotal2) %>%
select(region1:gtotal1)
df3 = df1 %>%
rows_update(agg_dt, by = c("region1", "fruit")) %>%
rename_with(~ gsub("1$", "", .x)) # update col names
Old Answer
Here's a possible solution.
We first create citrus_cat
which only keeps the citrus name and not its' type. e.g. "lemon (lisbon)" -> "lemon".
Then group by region2
and citrus_cat
to get the sum of count2
and gtotal2
for those group pairings.
Next, we left join the new df2
to the original df1
and filter where citrus_cat
exists in fruit
. Here is where we subtract our total amounts found in df2
from df1
.
Lastly, since there are items present in df1 that did not have a match in the new aggregated table. We use row_update
to only update the rows were there was a match.
Let me know if you have any questions!
library(dplyr)
library(stringr)
df2_by_grp = df2 %>%
mutate(citrus_cat = word(citrus, 2)) %>% #extract second word
group_by(region2, citrus_cat) %>%
summarise(across(c(count2, gtotal2), ~ sum(., na.rm = T)))
agg_dt = df1 %>%
left_join(df2_by_grp, by = c("region1" = "region2")) %>%
mutate(fruit = as.character(fruit)) %>%
rowwise() %>%
filter(str_detect(fruit, citrus_cat)) %>%
mutate(count1 = count1 - count2, gtotal1 = gtotal1 - gtotal2) %>%
select(region1:gtotal1)
df3 = df1 %>%
rows_update(agg_dt, by = c("region1", "fruit")) %>%
rename_with(~ gsub("1$", "", .x)) # update col names
> df3
region fruit count gtotal
1 north 2. citrus 1719 21898
2 north 3. pear and orange -7862 -80317
3 north 5. bananas 2779 18334
4 north 1. kiwi and lemon -23 -6165
5 north 6. plums 2064 14641
6 north 4. apple and lime 740 -1486
7 north-east 2. citrus 2230 22241
8 north-east 3. pear and orange -9083 -86982
9 north-east 5. bananas 1656 24611
10 north-east 1. kiwi and lemon -261 16506
11 north-east 6. plums 2612 17545
12 north-east 4. apple and lime 391 -14262
13 east 2. citrus 2200 25029
14 east 3. pear and orange -8380 -82234
15 east 5. bananas 1722 16560
16 east 1. kiwi and lemon 1920 7200
17 east 6. plums 2248 18576
18 east 4. apple and lime -1334 -6700
...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论