R语言从跨多个组的不同数据集中的值中减去多个值的总和

klsxnrf1 于 2023-07-31 发布在其他

关注(0)|答案(1)|浏览(79)

我有两个相关的数据集，一个包含不同类型水果的计数和总计，包括柑橘类水果，另一个包含柑橘类水果的计数。两个数据集都包含来自相同地区的数据，我需要从每个地区的df 1中减去df 2中的柑橘类水果的数量。
数据：

set.seed(123)

region1 <- as.factor(rep(c('north', 'north-east', 'east', 'south-east', 'south', 
                      'south-west', 'west', 'north-west', 'centre', 'islands'),
                      each = 6))
fruit <- as.factor(rep(c('2. citrus', '3. pear and orange', '5. bananas', 
                         '1. kiwi and lemon', '6. plums', '4. apple and lime'), 10))
count1 <- as.integer(signif(rnorm(60, mean = 2000, sd = 500)), 1)
gtotal1 <- as.numeric(round(rnorm(60, mean = 20000, sd = 5000)), 0)
df1 <- data.frame(region1, fruit, count1, gtotal1)

region2 <- as.factor(rep(c('north', 'north-east', 'east', 'south-east', 'south', 
                          'south-west', 'west', 'north-west', 'centre', 
                          'islands'),
                        each = 7))
citrus <- as.factor(rep(c('6. lisbon (lemon)', '24. easy p. (orange)', '25. navel (orange)', 
                          '37. blood (orange)', '37. tang. (orange)', '43. mand. (orange)',
                          '46. key (lime)'), 10))
count2 <- as.integer(signif(rnorm(70, mean = 2000, sd = 500)), 1)
gtotal2 <- as.numeric(round(rnorm(70, mean = 20000, sd = 5000)), 0)
df2 <- data.frame(region2, citrus, count2, gtotal2)

字符串
在df 1中，不同柑橘类水果的计数和gtotal与其他种类的水果（例如，苹果）一起被包括。猕猴桃和柠檬），则通过其他方式创建“柑橘类”类别以给予它们自己的类别，但是不同柑橘类水果的值仍然包括在其他配对类别中。所有10个区域均存在此问题。
DF 2包含每个区域的配对DF 1类别内的这些柑橘类水果的计数。我需要从每个区域的df 1中的类别中减去df 2中的柠檬、橙子和酸橙的总数。
这种情况和数据是合成的，因此忽略给定值中的任何错误。我需要对count和gtotal列执行此操作。
以下是两个数据集：

df1                                              df2
      region1             fruit count1 gtotal1 |       region2             citrus count2 gtotal2
1       north         2. citrus   1719   21898 |       north     6. lisbon (lemon)   2058   21072
2       north 3. pear and orange  1884   17488 |       north  24. easy p. (orange)   1526   18377
3       north        5. bananas   2779   18334 |       north    25. navel (orange)   1754   20473
4       north 1. kiwi and lemon   2035   14907 |       north    37. blood (orange)   1871   15523
5       north          6. plums   2064   14641 |       north    38. tang. (orange)   2921   13446
6       north  4. apple and lime  2857   21518 |       north    43. mand. (orange)   1674   29986
7  north-east         2. citrus   2230   22241 |       north        46. key (lime)   2117   23004
8  north-east 3. pear and orange  1367   20265 |  north-east     6. lisbon (lemon)   2038   13744
9  north-east        5. bananas   1656   24611 |  north-east  24. easy p. (orange)   1519   16944
10 north-east 1. kiwi and lemon   1777   30250 |  north-east    25. navel (orange)   1964   14073
11 north-east          6. plums   2612   17545 |  north-east    37. blood (orange)   2722   30994
12 north-east  4. apple and lime  2179    8454 |  north-east    38. tang. (orange)   2225   26562
13       east         2. citrus   2200   25029 |  north-east    43. mand. (orange)   2020   18674
14       east 3. pear and orange  2055   16454 |  north-east        46. key (lime)   1788   22716
15       east        5. bananas   1722   16560 |        east     6. lisbon (lemon)    973   17928
...

型
以下是我想要的：

df3
        region             fruit count gtotal 
1       north         2. citrus   1719   21898 
2       north 3. pear and orange -7862  -80317 
3       north        5. bananas   2779   18334 
4       north 1. kiwi and lemon    -23   -6165
5       north          6. plums   2064   14641 
6       north 4. apple and lime    740   -1486
7  north-east         2. citrus   2230   22241 
8  north-east 3. pear and orange -9083  -86982  
9  north-east        5. bananas   1656   24611 
10 north-east  1. kiwi and lemon  -261   16506
11 north-east          6. plums   2612   17545 
12 north-east 4. apple and lime    391  -14262 
13       east         2. citrus   2200   25029 
14       east 3. pear and orange -8380  -82234
15       east        5. bananas   1722   16560 
...

型
我知道如何手动做到这一点，通过按区域分割数据并使用基本R命令进行减法，但这将是耗时的，我相信有一个更好的方法来做到这一点，使用dplyr或ifelse()语句，我想学习这一点，为未来的问题。
提前感谢！

r

来源：https://stackoverflow.com/questions/76681517/subtract-the-sum-of-multiple-values-from-a-value-in-a-different-dataset-across-m

1条答案

按热度按时间

zzlelutf1#

新增答案

根据我收集的信息，你想在两个数据集（df1$fruit，df2$citrus）中比较的cols没有任何识别特征来匹配它们。因此，我建议手动创建一个Map表，以便您可以正确地链接所有内容。我在这里创建了fruit_map df。
继续-这个过程实际上比以前更直接一些，你使用fruit_map表通过fruit_grp聚合你的柑橘类型。
我看不出像你建议的那样手动聚合所有类型的价值。这可能容易出错，遵循较差的编码技术，难以阅读等。
希望这更适用于您的用例。

# Create relationship between fruit and citrus
fruit_map = data.frame(
  fruit_grp = as.factor(c("1. kiwi and lemon",rep("3. pear and orange",5),"4. apple and lime")),
  citrus = as.factor(c("6. lisbon (lemon)","24. easy p. (orange)","25. navel (orange)","37. blood (orange)","37. tang. (orange)","43. mand. (orange)","46. key (lime)"))
)

df2_by_grp = df2 %>%
  left_join(fruit_map, by = "citrus") %>%
  group_by(region2, fruit_grp) %>%
  summarise(across(c(count2, gtotal2), ~ sum(., na.rm = T))) %>%
  rename("fruit" = "fruit_grp")

agg_dt = df1 %>%
  inner_join(df2_by_grp, by = c("region1" = "region2", "fruit")) %>%
  mutate(count1 = count1 - count2, gtotal1 = gtotal1 - gtotal2) %>%
  select(region1:gtotal1)

df3 = df1 %>%
  rows_update(agg_dt, by = c("region1", "fruit")) %>%
  rename_with(~ gsub("1$", "", .x)) # update col names

字符串

旧答案

这里有一个可能的解决方案。
我们首先创建citrus_cat，它只保留柑橘的名称，而不是它的类型。例如“lemon（里斯本）”->“lemon”。
然后按region2和citrus_cat分组，得到count2和gtotal2的和。
接下来，我们将新的df2连接到原始的df1并过滤fruit中存在的citrus_cat。这里是我们从df1减去df2中找到的总量。
最后，由于df 1中存在的项目在新的聚合表中没有匹配。我们使用row_update只更新匹配的行。
如果你有任何问题，请告诉我！

library(dplyr)
library(stringr)

df2_by_grp = df2 %>%
  mutate(citrus_cat = word(citrus, 2)) %>% #extract second word
  group_by(region2, citrus_cat) %>%
  summarise(across(c(count2, gtotal2), ~ sum(., na.rm = T)))

agg_dt = df1 %>% 
  left_join(df2_by_grp, by = c("region1" = "region2")) %>%
  mutate(fruit = as.character(fruit)) %>%
  rowwise() %>%
  filter(str_detect(fruit, citrus_cat)) %>%
  mutate(count1 = count1 - count2, gtotal1 = gtotal1 - gtotal2) %>%
  select(region1:gtotal1)

df3 = df1 %>%
  rows_update(agg_dt, by = c("region1", "fruit")) %>%
  rename_with(~ gsub("1$", "", .x)) # update col names

> df3

       region              fruit count gtotal
1       north          2. citrus  1719  21898
2       north 3. pear and orange -7862 -80317
3       north         5. bananas  2779  18334
4       north  1. kiwi and lemon   -23  -6165
5       north           6. plums  2064  14641
6       north  4. apple and lime   740  -1486
7  north-east          2. citrus  2230  22241
8  north-east 3. pear and orange -9083 -86982
9  north-east         5. bananas  1656  24611
10 north-east  1. kiwi and lemon  -261  16506
11 north-east           6. plums  2612  17545
12 north-east  4. apple and lime   391 -14262
13       east          2. citrus  2200  25029
14       east 3. pear and orange -8380 -82234
15       east         5. bananas  1722  16560
16       east  1. kiwi and lemon  1920   7200
17       east           6. plums  2248  18576
18       east  4. apple and lime -1334  -6700
...

赞(0）回复(0）举报 2023-07-31

我来回答

R语言从跨多个组的不同数据集中的值中减去多个值的总和

1条答案

新增答案

旧答案

相关问题

热门标签

最新问答

R语言 从跨多个组的不同数据集中的值中减去多个值的总和

1条答案

新增答案

旧答案

相关问题

热门标签

最新问答

R语言从跨多个组的不同数据集中的值中减去多个值的总和