R语言合并2个数据框并从非连接列中平均分配行

qf9go6mv 于 2023-04-03 发布在其他

关注(0)|答案(3)|浏览(110)

我有两个不同长度的 Dataframe ，它们有一个公共列。我需要做的是将它们组合起来，但要以一种平均分配非公共列中值的方式。因此，如果我们有Users：

User    Category
John    A
John    D
Will    A
Will    E
Bea     P
Bea     E
Sarah   A
Sarah   B

并声称：

Category    Claim
A             1
A             2
B             3
B             4
D             5
D             6
D             7
D             8
D             9
D             10
D             11
D             12
A             13
A             14
A             15
A             16
A             17
A             18
E             19
E             20
E             21
E             22
E             23
E             24
E             25
E             26
E             27
E             28
P             29
P             30
P             31
P             32
P             33
P             34

我想为每个用户提供一个平等的索赔数量的基础上类别-即索赔将被3个用户之间平均分割。

来源：https://stackoverflow.com/questions/57398185/merge-2-data-frames-and-allocate-rows-from-a-non-join-column-equally

3条答案

按热度按时间

gupuwyp21#

在这里，解释在评论中：

library("dplyr")

# Creating a "user number" which is their ID among 
# other users having this category... When allocating claims, we'll know
# "this is user 2 out of 3 for category A, I need to assign the second third of the A claims."
users <- 
  users %>%
  group_by(Category) %>%
  arrange(Category) %>%
  mutate(user_number = 1:n(), 
         total_users = n())

# Same thing for claims: this will allow us to identify the "second third of A claims"
claims <- 
  claims %>%
  group_by(Category) %>%
  mutate(claim_number = 1:n(),
         total_claims = n())

user_claims <- 
  users %>%
  # full join gives all the XXX claims to everyone in category XXX
  full_join(claims) %>%
  # We only keep the fraction of the claims that "belongs" to the user
  filter(claim_number > total_claims * (user_number - 1) / total_users, 
         claim_number <= total_claims * (user_number) / total_users)

赞(0）回复(0）举报 2023-04-03

bd1hkmkf2#

library(dplyr)

claim %>% 
  count(Category, name="Claims") %>% 
  left_join(user, ., by=c("Category")) %>% 
  add_count(Category) %>% 
  mutate(Claims = Claims / n) %>% 
  select(-n)

#> # A tibble: 8 x 3
#>   User  Category Claims
#>   <fct> <fct>     <dbl>
#> 1 John  A          2.67
#> 2 John  D          8   
#> 3 Will  A          2.67
#> 4 Will  E          5   
#> 5 Bea   P          6   
#> 6 Bea   E          5   
#> 7 Sarah A          2.67
#> 8 Sarah B          2

数据：

claim <- structure(list(Category = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
                                               3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 
                                               4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L), 
                                            .Label = c("A", "B", "D", "E", "P"), 
                                             class = "factor"), 
                        Claim = 1:34), 
                   class = "data.frame", row.names = c(NA, -34L))

user <- structure(list(User = structure(c(2L, 2L, 4L, 4L, 1L, 1L, 3L, 3L), 
                                       .Label = c("Bea", "John", "Sarah", "Will"), 
                                        class = "factor"), 
                       Category = structure(c(1L, 3L, 1L, 4L, 5L, 4L, 1L, 2L), 
                                           .Label = c("A", "B", "D", "E", "P"), 
                                            class = "factor")), 
                  class = "data.frame", row.names = c(NA, -8L))

赞(0）回复(0）举报 2023-04-03

0yg35tkg3#

以下是data.table的方法：

Users <- data.table(User = rep(c("John","Will","Bea","Sarah"),each = 2), Category = c("A","D","A","E","P","E","A","B"))

set.seed(1)
Claims <- data.table(Category = sample(c("A","D","E","P"), replace = TRUE, 34), Claim = 1:34)

claims_joined <- merge(Users, Claims, by = "Category", allow.cartesian = TRUE)

claims_joined[, mod_base := uniqueN(User), by = .(Category)]
claims_joined <- claims_joined[, .(User = User[1L + (.GRP %% mod_base)][1]), by = .(Category, Claim)]

dcast(claims_joined, Category ~ User, fun.aggregate = length)
   Category Bea John Sarah Will
1:        A   0    2     3    3
2:        D   0   11     0    0
3:        E   3    0     0    4
4:        P   8    0     0    0

实际上，您执行了一个完整的外部连接，然后设置一个索引，该索引为每个声明递增。然后，您将该索引按该类别中的用户数取模，然后使用该索引为该类别中的每个声明选择一个循环用户

赞(0）回复(0）举报 2023-04-03

我来回答

R语言合并2个数据框并从非连接列中平均分配行

3条答案

相关问题

热门标签

最新问答

R语言 合并2个数据框并从非连接列中平均分配行

3条答案

相关问题

热门标签

最新问答

R语言合并2个数据框并从非连接列中平均分配行