根据列中连续值形成的模式从 Dataframe 中选择行

f3temu5u 于 2022-12-20 发布在其他

关注(0)|答案(3)|浏览(128)

我在R上一般，我想在下面的操作上得到一些帮助。
假设我有以下 Dataframe ：

>df
    ID   Label
    P1   M
    P1   S
    P2   M
    P2   M
    P2   S
    P3   M
    P3   S
    P3   M
    P4   S
    P4   M
    P5   M
    P5   M
    P5   S

我希望能够针对每个ID选择变量Label的特定序列中出现的行。
对于模式"MS"，预期输出为

ID   Label
    P1   M
    P1   S
    P2   M
    P2   S
    P3   M
    P3   S

而对于模式"MMS"，期望输出将是

ID   Label
    P2   M
    P2   M
    P2   S
    P5   M
    P5   M
    P5   S

对于模式"SM"，期望输出为：

ID   Label
    P3   S
    P3   M
    P4   S
    P4   M

请考虑到我正在处理的数据有很多行，而我需要构建的解决方案需要适用于任意长度的模式。（例如，“MSS”，“SM”，“MMSSMS”等）。我谦卑地请求您的帮助。
编辑：我已经更新了这个问题（示例 Dataframe 和模式"MMS"的输出示例。我想补充的是，我希望模式匹配发生在使用ID变量对数据进行分组之后，以便可以在按ID分组的数据组中找到模式。抱歉，第一次没有说清楚。
最终编辑：@akrun、@boski和@tmfmnk的答案对我很有效。与@tmfmnk的解决方案（400 k行数据约29秒）相比，@boski和@akrun的解决方案在执行时间上更快（400 k行数据约2-10秒）。我建议读者参考所有这三个解决方案。

来源：https://stackoverflow.com/questions/54749165/select-rows-from-a-dataframe-based-on-a-pattern-formed-by-consecutive-values-in

3条答案

按热度按时间

1mrurvl11#

一个选项是比较lead值并获取按"ID"分组的索引

library(data.table)
i1 <- unique(setDT(df)[, lapply(which(Reduce(`&`, 
  Map(`==`, shift(Label, n = 0:2, type = "lead"), c("M", "M", "S")))), 
       function(i) .I[i:(i+2)]) , by = ID]$V1)
df[i1]
#    ID Label
#1: P2     M
#2: P2     M
#3: P2     S
#4: P5     M
#5: P5     M
#6: P5     S

数据

df <- structure(list(ID = c("P1", "P1", "P2", "P2", "P2", "P3", "P3", 
"P3", "P4", "P4", "P5", "P5", "P5"), Label = c("M", "S", "M", 
"M", "S", "M", "S", "M", "S", "M", "M", "M", "S")), 
class = "data.frame", row.names = c(NA, -13L))

赞(0）回复(0）举报 2022-12-20

jv2fixgn2#

您可以尝试使用gregexpr()。首先粘贴所有标签并找到您要查找的图案的起始位置。

> df
   ID Label
1  P1     M
2  P1     S
3  P2     M
4  P2     M
5  P2     S
6  P3     M
7  P3     S
8  P3     S
9  P4     S
10 P4     M
11 P5     M
12 P5     M
13 P5     S

- 编辑**

我以前的解决方案没有检索整个模式（只是开始）。

pattern="SM"
starts=gregexpr(pattern=pattern,paste(df$Label,collapse=""))[[1]]
positions=as.vector(sapply(starts,function(x){ 
  s=seq(x,x+nchar(pattern)-1)
  if (all(df$ID[s]==df$ID[x])){
    return(s)
  } else {return(rep(NA,nchar(pattern)))}
  }))
positions=positions[which(!is.na(positions))]

df[positions,]
df[positions,]
   ID Label
1  P1     M
2  P1     S
4  P2     M
5  P2     S
6  P3     M
7  P3     S
12 P5     M
13 P5     S

pattern="MMS"
   ID Label
3  P2     M
4  P2     M
5  P2     S
11 P5     M
12 P5     M
13 P5     S

pattern="SM"
   ID Label
9  P4     S
10 P4     M

赞(0）回复(0）举报 2022-12-20

unftdfkk3#

原始问题的一个基本解决方案可以是：

nchar <- nchar("MS")
x <- grepRaw("MS", paste(df$Label, collapse = ""), all = TRUE)
y <- rep(x, each = nchar) + 0:(nchar - 1)

df[1:nrow(df) %in% y, ]

  ID Label
1 P1     M
2 P1     S
4 P2     M
5 P2     S
6 P3     M
7 P3     S

nchar <- nchar("SM")
x <- grepRaw("SM", paste(df$Label, collapse = ""), all = TRUE)
y <- rep(x, each = nchar) + 0:(nchar - 1)

df[1:nrow(df) %in% y, ]

   ID Label
2  P1     S
3  P2     M
5  P2     S
6  P3     M
9  P4     S
10 P4     M

或以dplyr形式编写：

nchar <- nchar("MS")
df %>%
 filter(row_number() %in% c(rep(grepRaw("MS", paste(Label, collapse = ""), all = TRUE), 
            each = nchar) + 0:(nchar - 1)))

   ID Label
1  P1     M
2  P1     S
3  P2     M
4  P2     S
5  P3     M
6  P3     S
7  P3     M
8  P4     S
9  P5     M
10 P5     S

同时解决问题的编辑：

nchar <- nchar("MS")
df %>%
 group_by(ID) %>%
 filter(row_number() %in% c(rep(grepRaw("MS", paste(Label, collapse = ""), all = TRUE), 
            each = nchar) + 0:(nchar - 1)))

  ID    Label
  <fct> <fct>
1 P1    M    
2 P1    S    
3 P2    M    
4 P2    S    
5 P3    M    
6 P3    S    
7 P5    M    
8 P5    S

赞(0）回复(0）举报 2022-12-20

我来回答

根据列中连续值形成的模式从 Dataframe 中选择行

3条答案

数据

相关问题

热门标签

最新问答