如何通过group_by中的组号对数据表进行编号/标记？

yxyvkwin 于 2023-04-27 发布在其他

关注(0)|答案(6)|浏览(124)

我有一个tbl_df，其中我希望group_by(u, v)用于观察(u, v)的每个不同整数组合。

**编辑：**这个问题后来通过在dplyr 0.4.0中添加（现已弃用）group_indices()得到解决

a）然后我想给每个不同的组分配一个任意不同的数字label= 1，2，3...例如，组合（u，v）==（2，3）可以得到label 1，（1，3）可以得到2，等等。如何用一个mutate()来完成这个任务，而不需要三步的总结和自连接？
dplyr有一个简洁的函数n()，但它给出了组内元素的数量，而不是组**. In data.table this would simply be called .GRP的总**数量。
B）实际上我真正想分配字符串/字符标签（'A'，'B'，...）.但是用整数来编号组就足够了，因为我可以使用integer_to_label(i)如下所示.除非有一个聪明的方法来合并这两个？但不要担心这部分.

set.seed(1234)

# Helper fn for mapping integer 1..26 to character label
integer_to_label <- function(i) { substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",i,i) }

df <- tibble::as_tibble(data.frame(u=sample.int(3,10,replace=T), v=sample.int(4,10,replace=T)))

# Want to label/number each distinct group of unique (u,v) combinations
df %>% group_by(u,v) %>% mutate(label = n()) # WRONG: n() is number of element within its group, not overall number of group

   u v
1  2 3
2  1 3
3  1 2
4  2 3
5  1 2
6  3 3
7  1 3
8  1 2
9  3 1
10 3 4

KLUDGE 1: could do df %>% group_by(u,v) %>% summarize(label = n()) , then self-join

来源：https://stackoverflow.com/questions/23026145/how-to-number-label-data-table-by-group-number-from-group-by

6条答案

按热度按时间

ui7jx7zq1#

适用于当前的dplyr版本（1.0.0及更高版本）

从1.0版本开始，dplyr有了一个新的cur_group_id函数：

df %>% 
    group_by(u, v) %>% 
    mutate(label = cur_group_id()) ...

对于以前的dplyr版本（1.0.0之前，虽然该函数已弃用，但在1.0.10中仍然可用）

dplyr有一个group_indices()函数，你可以这样使用：

df %>% 
    mutate(label = group_indices(., u, v)) %>% 
    group_by(label) ...

赞(0）回复(0）举报 2023-04-27

wgeznvg72#

使用data.table的另一种方法是

require(data.table)
setDT(df)[,label:=.GRP, by = c("u", "v")]

这导致：

u v label
 1: 2 1     1
 2: 1 3     2
 3: 2 1     1
 4: 3 4     3
 5: 3 1     4
 6: 1 1     5
 7: 3 2     6
 8: 2 3     7
 9: 3 2     6
10: 3 4     3

赞(0）回复(0）举报 2023-04-27

bcs8qyzn3#

从dplyr版本1.0.4开始，函数cur_group_id()取代了旧的函数group_indices。
在分组 Dataframe 上调用它：

df %>%
  group_by(u, v) %>%
  mutate(label = cur_group_id())

# A tibble: 10 x 3
# Groups:   u, v [6]
       u     v label
   <int> <int> <int>
 1     2     2     4
 2     2     2     4
 3     1     3     2
 4     3     2     6
 5     1     4     3
 6     1     2     1
 7     2     2     4
 8     2     4     5
 9     3     2     6
10     2     4     5

赞(0）回复(0）举报 2023-04-27

5jvtdoz24#

更新答案

get_group_number = function(){
    i = 0
    function(){
        i <<- i+1
        i
    }
}
group_number = get_group_number()
df %>% group_by(u,v) %>% mutate(label = group_number())

您还可以考虑以下稍微不可读的版本

group_number = (function(){i = 0; function() i <<- i+1 })()
df %>% group_by(u,v) %>% mutate(label = group_number())

使用iterators包

library(iterators)

counter = icount()
df %>% group_by(u,v) %>% mutate(label = nextElem(counter))

赞(0）回复(0）举报 2023-04-27

f8rj6qna5#

用三种不同的方式更新我的答案：
A）使用interaction(u,v)的纯非稀释剂溶液：

> df$label <- factor(interaction(df$u,df$v, drop=T))
 [1] 1.3 2.3 2.2 2.4 3.2 2.4 1.2 1.2 2.1 2.1
 Levels: 2.1 1.2 2.2 3.2 1.3 2.3 2.4

> match(df$label, levels(df$label)[ rank(unique(df$label)) ] )
 [1] 1 2 3 4 5 4 6 6 7 7

B）使Randy的整洁快速和肮脏的生成器函数答案更加紧凑：

get_next_integer = function(){
  i = 0
  function(u,v){ i <<- i+1 }
}
get_integer = get_next_integer() 

df %>% group_by(u,v) %>% mutate(label = get_integer())

C）这里还有一个使用生成器函数滥用this全局变量赋值的一行代码：

i <- 0
generate_integer <- function() { return(assign('i', i+1, envir = .GlobalEnv)) }

df %>% group_by(u,v) %>% mutate(label = generate_integer())

rm(i)

赞(0）回复(0）举报 2023-04-27

5tmbdcev6#

我没有足够的声誉发表评论，所以我发布了一个答案。
使用factor（）的解决方案是一个很好的解决方案，但它有一个缺点，即在factor（）将其级别按字母顺序排列后才分配组编号。同样的行为也发生在dplyr的group_indices（）中。也许你希望根据当前的组顺序将组编号从1分配到n。在这种情况下，你可以用途：

my_tibble %>% mutate(group_num = as.integer(factor(group_var, levels = unique(.$group_var))) )

赞(0）回复(0）举报 2023-04-27

我来回答

如何通过group_by中的组号对数据表进行编号/标记？

6条答案

相关问题

热门标签

最新问答