R语言 合并2个数据框并从非连接列中平均分配行

qf9go6mv  于 2023-04-03  发布在  其他
关注(0)|答案(3)|浏览(110)

我有两个不同长度的 Dataframe ,它们有一个公共列。我需要做的是将它们组合起来,但要以一种平均分配非公共列中值的方式。因此,如果我们有Users:

User    Category
John    A
John    D
Will    A
Will    E
Bea     P
Bea     E
Sarah   A
Sarah   B

并声称:

Category    Claim
A             1
A             2
B             3
B             4
D             5
D             6
D             7
D             8
D             9
D             10
D             11
D             12
A             13
A             14
A             15
A             16
A             17
A             18
E             19
E             20
E             21
E             22
E             23
E             24
E             25
E             26
E             27
E             28
P             29
P             30
P             31
P             32
P             33
P             34

我想为每个用户提供一个平等的索赔数量的基础上类别-即索赔将被3个用户之间平均分割。

gupuwyp2

gupuwyp21#

在这里,解释在评论中:

library("dplyr")

# Creating a "user number" which is their ID among 
# other users having this category... When allocating claims, we'll know
# "this is user 2 out of 3 for category A, I need to assign the second third of the A claims."
users <- 
  users %>%
  group_by(Category) %>%
  arrange(Category) %>%
  mutate(user_number = 1:n(), 
         total_users = n())

# Same thing for claims: this will allow us to identify the "second third of A claims"
claims <- 
  claims %>%
  group_by(Category) %>%
  mutate(claim_number = 1:n(),
         total_claims = n())

user_claims <- 
  users %>%
  # full join gives all the XXX claims to everyone in category XXX
  full_join(claims) %>%
  # We only keep the fraction of the claims that "belongs" to the user
  filter(claim_number > total_claims * (user_number - 1) / total_users, 
         claim_number <= total_claims * (user_number) / total_users)
bd1hkmkf

bd1hkmkf2#

library(dplyr)

claim %>% 
  count(Category, name="Claims") %>% 
  left_join(user, ., by=c("Category")) %>% 
  add_count(Category) %>% 
  mutate(Claims = Claims / n) %>% 
  select(-n)

#> # A tibble: 8 x 3
#>   User  Category Claims
#>   <fct> <fct>     <dbl>
#> 1 John  A          2.67
#> 2 John  D          8   
#> 3 Will  A          2.67
#> 4 Will  E          5   
#> 5 Bea   P          6   
#> 6 Bea   E          5   
#> 7 Sarah A          2.67
#> 8 Sarah B          2

数据:

claim <- structure(list(Category = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
                                               3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 
                                               4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L), 
                                            .Label = c("A", "B", "D", "E", "P"), 
                                             class = "factor"), 
                        Claim = 1:34), 
                   class = "data.frame", row.names = c(NA, -34L))

user <- structure(list(User = structure(c(2L, 2L, 4L, 4L, 1L, 1L, 3L, 3L), 
                                       .Label = c("Bea", "John", "Sarah", "Will"), 
                                        class = "factor"), 
                       Category = structure(c(1L, 3L, 1L, 4L, 5L, 4L, 1L, 2L), 
                                           .Label = c("A", "B", "D", "E", "P"), 
                                            class = "factor")), 
                  class = "data.frame", row.names = c(NA, -8L))
0yg35tkg

0yg35tkg3#

以下是data.table的方法:

Users <- data.table(User = rep(c("John","Will","Bea","Sarah"),each = 2), Category = c("A","D","A","E","P","E","A","B"))

set.seed(1)
Claims <- data.table(Category = sample(c("A","D","E","P"), replace = TRUE, 34), Claim = 1:34)

claims_joined <- merge(Users, Claims, by = "Category", allow.cartesian = TRUE)

claims_joined[, mod_base := uniqueN(User), by = .(Category)]
claims_joined <- claims_joined[, .(User = User[1L + (.GRP %% mod_base)][1]), by = .(Category, Claim)]

dcast(claims_joined, Category ~ User, fun.aggregate = length)
   Category Bea John Sarah Will
1:        A   0    2     3    3
2:        D   0   11     0    0
3:        E   3    0     0    4
4:        P   8    0     0    0

实际上,您执行了一个完整的外部连接,然后设置一个索引,该索引为每个声明递增。然后,您将该索引按该类别中的用户数取模,然后使用该索引为该类别中的每个声明选择一个循环用户

相关问题