从不同数据集中的多个组中减去多个值的总和

huangapple go评论54阅读模式
英文:

Subtract the sum of multiple values from a value in a different dataset across multiple groups

问题

以下是翻译好的部分:

我有两个相关的数据集,一个包含不同类型的水果,包括柑橘类水果的计数和总数,另一个包含只有柑橘类水果的计数。两个数据集都包含来自相同地区的数据,我需要按地区从df1中减去df2中的柑橘类水果的数量。

数据:

```R
set.seed(123)

region1 <- as.factor(rep(c('north', 'north-east', 'east', 'south-east', 'south', 
                      'south-west', 'west', 'north-west', 'centre', 'islands'),
                      each = 6))
fruit <- as.factor(rep(c('2. citrus', '3. pear and orange', '5. bananas', 
                         '1. kiwi and lemon', '6. plums', '4. apple and lime'), 10))
count1 <- as.integer(signif(rnorm(60, mean = 2000, sd = 500)), 1)
gtotal1 <- as.numeric(round(rnorm(60, mean = 20000, sd = 5000)), 0)
df1 <- data.frame(region1, fruit, count1, gtotal1)

region2 <- as.factor(rep(c('north', 'north-east', 'east', 'south-east', 'south', 
                          'south-west', 'west', 'north-west', 'centre', 
                          'islands'), each = 7))
citrus <- as.factor(rep(c('6. lisbon (lemon)', '24. easy p. (orange)', '25. navel (orange)', 
                          '37. blood (orange)', '37. tang. (orange)', '43. mand. (orange)',
                          '46. key (lime)'), 10))
count2 <- as.integer(signif(rnorm(70, mean = 2000, sd = 500)), 1)
gtotal2 <- as.numeric(round(rnorm(70, mean = 20000, sd = 5000)), 0)
df2 <- data.frame(region2, citrus, count2, gtotal2)

在df1中,不同柑橘类水果的计数和总数与其他种类的水果(例如猕猴桃和柠檬)一起包括在内,“柑橘类”类别是通过其他方式创建的,以给予它们自己的类别,但不同柑橘类水果的值仍包括在其他成对的类别中。这个问题存在于所有10个地区中。

df2包含每个地区中与df1类别中成对的柑橘类水果的计数。我需要从df1中的每个地区的类别中减去df2中柠檬、橙子和酸橙的总数。

这种情况和数据是合成的,所以忽略所给值中的任何错误。我需要对计数和gtotal列执行此操作。

这是两个数据集:

df1                                              df2
      region1             fruit count1 gtotal1 |       region2             citrus count2 gtotal2
1       north         2. citrus   1719   21898 |       north     6. lisbon (lemon)   2058   21072
2       north 3. pear and orange  1884   17488 |       north  24. easy p. (orange)   1526   18377
3       north        5. bananas   2779   18334 |       north    25. navel (orange)   1754   20473
4       north 1. kiwi and lemon   2035   14907 |       north    37. blood (orange)   1871   15523
5       north          6. plums   2064   14641 |       north    38. tang. (orange)   2921   13446
6       north  4. apple and lime  2857   21518 |       north    43. mand. (orange)   1674   29986
7  north-east         2. citrus   2230   22241 |       north        46. key (lime)   2117   23004
8  north-east 3. pear and orange  1367   20265 |  north-east     6. lisbon (lemon)   2038   13744
9  north-east        5. bananas   1656   24611 |  north-east  24. easy p. (orange)   1519   16944
10 north-east 1. kiwi and lemon   1777   30250 |  north-east    25. navel (orange)   1964   14073
11 north-east          6. plums   2612   17545 |  north-east    37. blood (orange)   2722   30994
12 north-east  4. apple and lime  2179    8454 |  north-east    38. tang. (orange)   2225   26562
13       east         2. citrus   2200   25029 |  north-east    43. mand. (orange)   2020   18674
14       east 3. pear and orange  2055   16454 |  north-east        46. key (lime)   1788   22716
15       east        5. bananas   1722   16560 |        east     6. lisbon (lemon)    973   17928
...

以下是我想要获得的内容:

df3
        region             fruit count gtotal 
1       north         2. citrus   1719   21898 
2       north 3. pear and orange -7862  -80317 
3       north        5. bananas   2779   18334 
4       north 1. kiwi and lemon    -23   -6165
5       north          6. plums   2064   14641 
6       north 4. apple and lime    740   -1486
7  north-east         2. citrus   2230   22241 
8  north-east 3. pear and orange -9083  -86982  
9  north-east        5. bananas   1656   24611 
10 north-east  1. kiwi and lemon 

<details>
<summary>英文:</summary>

I have two related datasets, one contains the counts and grand totals of different types of fruits, including citrus fruits, and the other contains counts of just citrus fruits. Both datasets contain data from the same regions and I need to subtract the numbers of citrus fruits in df2 from df1, per region.

The data:

set.seed(123)

region1 <- as.factor(rep(c('north', 'north-east', 'east', 'south-east', 'south',
'south-west', 'west', 'north-west', 'centre', 'islands'),
each = 6))
fruit <- as.factor(rep(c('2. citrus', '3. pear and orange', '5. bananas',
'1. kiwi and lemon', '6. plums', '4. apple and lime'), 10))
count1 <- as.integer(signif(rnorm(60, mean = 2000, sd = 500)), 1)
gtotal1 <- as.numeric(round(rnorm(60, mean = 20000, sd = 5000)), 0)
df1 <- data.frame(region1, fruit, count1, gtotal1)

region2 <- as.factor(rep(c('north', 'north-east', 'east', 'south-east', 'south',
'south-west', 'west', 'north-west', 'centre',
'islands'),
each = 7))
citrus <- as.factor(rep(c('6. lisbon (lemon)', '24. easy p. (orange)', '25. navel (orange)',
'37. blood (orange)', '37. tang. (orange)', '43. mand. (orange)',
'46. key (lime)'), 10))
count2 <- as.integer(signif(rnorm(70, mean = 2000, sd = 500)), 1)
gtotal2 <- as.numeric(round(rnorm(70, mean = 20000, sd = 5000)), 0)
df2 <- data.frame(region2, citrus, count2, gtotal2)


In df1, the counts and gtotals of different citrus fruits were included with other kinds of fruits (e.g. kiwis and lemons), the &quot;citrus&quot; category was created via other means to give them their own category, but the values of the different citrus fruits are still included in the other paired categories. This issue is present for all 10 regions.

df2 contains the counts of these citrus fruits within the paired df1 categories, per region. I need to subtract the total number of lemons, oranges, and limes in df2 from their categories in df1 for each region.

This situation and data is synthetic, so ignore any errors in the values given. I need to do this for the count and gtotal columns.

Here are the two datasets:

df1 df2
region1 fruit count1 gtotal1 | region2 citrus count2 gtotal2
1 north 2. citrus 1719 21898 | north 6. lisbon (lemon) 2058 21072
2 north 3. pear and orange 1884 17488 | north 24. easy p. (orange) 1526 18377
3 north 5. bananas 2779 18334 | north 25. navel (orange) 1754 20473
4 north 1. kiwi and lemon 2035 14907 | north 37. blood (orange) 1871 15523
5 north 6. plums 2064 14641 | north 38. tang. (orange) 2921 13446
6 north 4. apple and lime 2857 21518 | north 43. mand. (orange) 1674 29986
7 north-east 2. citrus 2230 22241 | north 46. key (lime) 2117 23004
8 north-east 3. pear and orange 1367 20265 | north-east 6. lisbon (lemon) 2038 13744
9 north-east 5. bananas 1656 24611 | north-east 24. easy p. (orange) 1519 16944
10 north-east 1. kiwi and lemon 1777 30250 | north-east 25. navel (orange) 1964 14073
11 north-east 6. plums 2612 17545 | north-east 37. blood (orange) 2722 30994
12 north-east 4. apple and lime 2179 8454 | north-east 38. tang. (orange) 2225 26562
13 east 2. citrus 2200 25029 | north-east 43. mand. (orange) 2020 18674
14 east 3. pear and orange 2055 16454 | north-east 46. key (lime) 1788 22716
15 east 5. bananas 1722 16560 | east 6. lisbon (lemon) 973 17928
...

And here is what I would like to obtain:

df3
region fruit count gtotal
1 north 2. citrus 1719 21898
2 north 3. pear and orange -7862 -80317
3 north 5. bananas 2779 18334
4 north 1. kiwi and lemon -23 -6165
5 north 6. plums 2064 14641
6 north 4. apple and lime 740 -1486
7 north-east 2. citrus 2230 22241
8 north-east 3. pear and orange -9083 -86982
9 north-east 5. bananas 1656 24611
10 north-east 1. kiwi and lemon -261 16506
11 north-east 6. plums 2612 17545
12 north-east 4. apple and lime 391 -14262
13 east 2. citrus 2200 25029
14 east 3. pear and orange -8380 -82234
15 east 5. bananas 1722 16560
...



I know how to do this manually, by splitting the data by region and doing subtractions using base R commands, but this would be time consuming I am sure there is a better way to do this using `dplyr` or `ifelse()` statements, and I would like to learn this for future issues.

Thank you in advance!

</details>


# 答案1
**得分**: 1

### 新回答

根据我所了解的情况,你想要在两个数据集中比较的列(df1$fruit, df2$citrus)没有任何标识特征来进行匹配。因此,我建议手动创建一个映射表,以便你可以正确地将它们连接起来。我在这里创建了一个名为 `fruit_map` 的数据框。

接下来的步骤实际上比以前要简单一些,你可以使用 `fruit_map` 表来按 `fruit_grp` 聚合你的柑橘类别。

我认为手动聚合每种类型并不值得。这样容易出错,遵循不良的编程技巧,难以阅读等等。

希望这对你的用例更加适用。

```R
# 创建水果和柑橘之间的关系
fruit_map = data.frame(
  fruit_grp = as.factor(c("1. kiwi and lemon",rep("3. pear and orange",5),"4. apple and lime")),
  citrus = as.factor(c("6. lisbon (lemon)","24. easy p. (orange)","25. navel (orange)","37. blood (orange)","37. tang. (orange)","43. mand. (orange)","46. key (lime)"))
)

df2_by_grp = df2 %>%
  left_join(fruit_map, by = "citrus") %>%
  group_by(region2, fruit_grp) %>%
  summarise(across(c(count2, gtotal2), ~ sum(., na.rm = T))) %>%
  rename("fruit" = "fruit_grp")

agg_dt = df1 %>%
  inner_join(df2_by_grp, by = c("region1" = "region2", "fruit")) %>%
  mutate(count1 = count1 - count2, gtotal1 = gtotal1 - gtotal2) %>%
  select(region1:gtotal1)

df3 = df1 %>%
  rows_update(agg_dt, by = c("region1", "fruit")) %>%
  rename_with(~ gsub("1$", "", .x)) # 更新列名

旧回答

以下是可能的解决方案。

首先,我们创建了 citrus_cat,其中只保留柑橘名称,而不包括其类型。例如,“lemon (lisbon)” -> “lemon”。

然后,按 region2citrus_cat 分组,以获取这些组合的 count2gtotal2 的总和。

接下来,我们将新的 df2 左连接到原始的 df1,并筛选其中 citrus_cat 存在于 fruit 中的部分。在这里,我们从 df1 中减去了在 df2 中找到的总数。

最后,由于在 df1 中存在未在新的聚合表中找到匹配项的项目。我们使用 row_update 仅更新了有匹配项的行。

如果你有任何问题,请告诉我!

library(dplyr)
library(stringr)

df2_by_grp = df2 %>%
  mutate(citrus_cat = word(citrus, 2)) %>% # 提取第二个单词
  group_by(region2, citrus_cat) %>%
  summarise(across(c(count2, gtotal2), ~ sum(., na.rm = T)))

agg_dt = df1 %>% 
  left_join(df2_by_grp, by = c("region1" = "region2")) %>%
  mutate(fruit = as.character(fruit)) %>%
  rowwise() %>%
  filter(str_detect(fruit, citrus_cat)) %>%
  mutate(count1 = count1 - count2, gtotal1 = gtotal1 - gtotal2) %>%
  select(region1:gtotal1)

df3 = df1 %>%
  rows_update(agg_dt, by = c("region1", "fruit")) %>%
  rename_with(~ gsub("1$", "", .x)) # 更新列名
> df3

       region              fruit count gtotal
1       north          2. citrus  1719  21898
2       north 3. pear and orange -7862 -80317
3       north         5. bananas  2779  18334
4       north  1. kiwi and lemon   -23  -6165
5       north           6. plums  2064  14641
6       north  4. apple and lime   740  -1486
7  north-east          2. citrus  2230  22241
8  north-east 3. pear and orange -9083 -86982
9  north-east         5. bananas  1656  24611
10 north-east  1. kiwi and lemon  -261  16506
11 north-east           6. plums  2612  17545
12 north-east  4. apple and lime   391 -14262
13       east          2. citrus  2200  25029
14       east 3. pear and orange -8380 -82234
15       east         5. bananas  1722  16560
16       east  1. kiwi and lemon  1920   7200
17       east           6. plums  2248  18576
18       east  4. apple and lime -1334  -6700
...
英文:

New Answer

From what I'm gathering, the cols that you want to compare in your two datasets (df1$fruit, df2$citrus), don't have any identifying traits to match them. Therefore, I would recommend manually creating a mapping table so you can link everything up properly. I created a fruit_map df here.

Continuing on - the process is actually a bit more straight forward than before where you use the fruit_map table to aggregate your citrus types by fruit_grp.

I don't see value in manually aggregating every type like you're suggesting. That can be prone to errors, follows poor coding techniques, difficult to read etc.

Hopefully, this is more applicable for your use case.

# Create relationship between fruit and citrus
fruit_map = data.frame(
  fruit_grp = as.factor(c(&quot;1. kiwi and lemon&quot;,rep(&quot;3. pear and orange&quot;,5),&quot;4. apple and lime&quot;)),
  citrus = as.factor(c(&quot;6. lisbon (lemon)&quot;,&quot;24. easy p. (orange)&quot;,&quot;25. navel (orange)&quot;,&quot;37. blood (orange)&quot;,&quot;37. tang. (orange)&quot;,&quot;43. mand. (orange)&quot;,&quot;46. key (lime)&quot;))
)

df2_by_grp = df2 %&gt;%
  left_join(fruit_map, by = &quot;citrus&quot;) %&gt;%
  group_by(region2, fruit_grp) %&gt;%
  summarise(across(c(count2, gtotal2), ~ sum(., na.rm = T))) %&gt;%
  rename(&quot;fruit&quot; = &quot;fruit_grp&quot;)

agg_dt = df1 %&gt;%
  inner_join(df2_by_grp, by = c(&quot;region1&quot; = &quot;region2&quot;, &quot;fruit&quot;)) %&gt;%
  mutate(count1 = count1 - count2, gtotal1 = gtotal1 - gtotal2) %&gt;%
  select(region1:gtotal1)

df3 = df1 %&gt;%
  rows_update(agg_dt, by = c(&quot;region1&quot;, &quot;fruit&quot;)) %&gt;%
  rename_with(~ gsub(&quot;1$&quot;, &quot;&quot;, .x)) # update col names

Old Answer

Here's a possible solution.

We first create citrus_cat which only keeps the citrus name and not its' type. e.g. "lemon (lisbon)" -> "lemon".

Then group by region2 and citrus_cat to get the sum of count2 and gtotal2 for those group pairings.

Next, we left join the new df2 to the original df1 and filter where citrus_cat exists in fruit. Here is where we subtract our total amounts found in df2 from df1.

Lastly, since there are items present in df1 that did not have a match in the new aggregated table. We use row_update to only update the rows were there was a match.

Let me know if you have any questions!

library(dplyr)
library(stringr)

df2_by_grp = df2 %&gt;%
  mutate(citrus_cat = word(citrus, 2)) %&gt;% #extract second word
  group_by(region2, citrus_cat) %&gt;%
  summarise(across(c(count2, gtotal2), ~ sum(., na.rm = T)))

agg_dt = df1 %&gt;% 
  left_join(df2_by_grp, by = c(&quot;region1&quot; = &quot;region2&quot;)) %&gt;%
  mutate(fruit = as.character(fruit)) %&gt;%
  rowwise() %&gt;%
  filter(str_detect(fruit, citrus_cat)) %&gt;%
  mutate(count1 = count1 - count2, gtotal1 = gtotal1 - gtotal2) %&gt;%
  select(region1:gtotal1)

df3 = df1 %&gt;%
  rows_update(agg_dt, by = c(&quot;region1&quot;, &quot;fruit&quot;)) %&gt;%
  rename_with(~ gsub(&quot;1$&quot;, &quot;&quot;, .x)) # update col names
&gt; df3

       region              fruit count gtotal
1       north          2. citrus  1719  21898
2       north 3. pear and orange -7862 -80317
3       north         5. bananas  2779  18334
4       north  1. kiwi and lemon   -23  -6165
5       north           6. plums  2064  14641
6       north  4. apple and lime   740  -1486
7  north-east          2. citrus  2230  22241
8  north-east 3. pear and orange -9083 -86982
9  north-east         5. bananas  1656  24611
10 north-east  1. kiwi and lemon  -261  16506
11 north-east           6. plums  2612  17545
12 north-east  4. apple and lime   391 -14262
13       east          2. citrus  2200  25029
14       east 3. pear and orange -8380 -82234
15       east         5. bananas  1722  16560
16       east  1. kiwi and lemon  1920   7200
17       east           6. plums  2248  18576
18       east  4. apple and lime -1334  -6700
...

huangapple
  • 本文由 发表于 2023年7月14日 00:23:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76681517.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定