regex 两个逗号分隔的字符串之间的R匹配

oalqel3c 于 2023-03-31 发布在其他

关注(0)|答案(4)|浏览(157)

我试图找到一种优雅的方法来查找数据框中两个字符列之间的匹配。复杂的部分是，任何一个字符串都可以包含一个逗号分隔的列表，如果一个列表的成员与另一个列表的任何成员匹配，那么整个条目都将被视为匹配。我不确定我解释得有多好，所以这里是示例数据和输出：
备选1：

AT
A
G
CGTCC、AT
CGC

Alt2：

AA
A
GG
AT、GGT
共格

每行的预期匹配：

第1行=无
第2行= A
第3行=无
第4行= AT
第5行=无

非工作溶液：
首次尝试：按所需列合并整个数据框，然后匹配上面显示的alt列：

match1 = data.frame(merge(vcf.df, ref.df, by=c("chr", "start", "end",  "ref")))
matches = unique(match1[unlist(sapply(match1$Alt1 grep, match1$Alt2, fixed=TRUE)),])

第二种方法，使用来自VariantAnnoatation/Granges的findoverlaps特征：

findoverlaps(ranges(vcf1), ranges(vcf2))

任何建议将不胜感激！谢谢！

解决方案感谢@马拉特Talipov在下面的回答，下面的解决方案可以比较两个逗号分隔的字符串：

> ##read in edited kaviar vcf and human ref
> ref <-     readVcfAsVRanges("ref.vcf.gz", humie_ref)
Warning message:
In .vcf_usertag(map, tag, ...) :
  ScanVcfParam ‘geno’ fields not present: ‘AD’

> ##rename chromosomes to match with vcf files
> ref <- renameSeqlevels(ref, c("1"="chr1"))

> ##################################
> ## Gather VCF files to process  ## 
> ##################################
> ##data frame *.vcf.gz files in directory path
> vcf_path <- data.frame(path=list.files(vcf_dir, pattern="*.vcf.gz$",  full=TRUE))

> ##read in everything but sample data for speediness
> vcf_param = ScanVcfParam(samples=NA)
> vcf <- readVcfAsVRanges("test.vcf.gz", humie_ref, param=vcf_param)

> #################
> ## Match SNP's ##
> #################
> ##create data frames of info to match on
> vcf.df = data.frame(chr =as.character(seqnames(vcf)), start = start(vcf),     end = end(vcf), ref = as.character(ref(vcf)), 
+                     alt=alt(vcf), stringsAsFactors=FALSE)
> ref.df = data.frame(chr =as.character(seqnames(ref)), start =     start(ref), end = end(ref), 
+                     ref = as.character(ref(ref)), alt=alt(ref),     stringsAsFactors=FALSE)
> 
> ##merge based on all positional fields except vcf
> col_match = data.frame(merge(vcf.df, ref.df, by=c("chr", "start", "end", "ref")))

> library(stringi)
> ##split each alt column by comma and bind together
> M1 <- stri_list2matrix(sapply(col_match$alt.x,strsplit,','))
> M2 <- stri_list2matrix(sapply(col_match$alt.y,strsplit,','))
> M <- rbind(M1,M2)

> ##compare results
> result <- apply(M,2,function(z) unique(na.omit(z[duplicated(z)])))

> ##add results column to col_match df for checking/subsetting
> col_match$match = result
> head(col_match)
   chr    start      end ref alt.x alt.y match
1 chr1 39998059 39998059   A     G     G     G
2 chr1 39998059 39998059   A     G     G     G
3 chr1 39998084 39998084   C     A     A     A
4 chr1 39998084 39998084   C     A     A     A
5 chr1 39998085 39998085   G     A     A     A
6 chr1 39998085 39998085   G     A     A     A

regex

来源：https://stackoverflow.com/questions/28590469/r-match-between-two-comma-separated-strings

4条答案

按热度按时间

uqzxnwby1#

如果输入列表的长度相等，并且您希望以成对的方式比较列表元素，则可以使用以下解决方案：

library(stringi)

M1 <- stri_list2matrix(sapply(Alt1,strsplit,','))
M2 <- stri_list2matrix(sapply(Alt2,strsplit,','))
M <- rbind(M1,M2)

result <- apply(M,2,function(z) unique(na.omit(z[duplicated(z)])))

样品输入：

Alt1 <- list('AT','A','G','CGTCC,AT','CGC','GG,CC')
Alt2 <- list('AA','A','GG','AT,GGT','CG','GG,CC')

输出：

# [[1]]
# character(0)
# 
# [[2]]
# [1] "A"
# 
# [[3]]
# character(0)
# 
# [[4]]
# [1] "AT"
# 
# [[5]]
# character(0)
# 
# [[6]]
# [1] "GG" "CC"

赞(0）回复(0）举报 2023-03-31

iyfjxgzm2#

继续使用stringi包，您可以使用马拉特的答案中的Alt1和Alt2数据执行类似的操作。

library(stringi)

f <- function(x, y) {
    ssf <- stri_split_fixed(c(x, y), ",", simplify = TRUE)
    if(any(sd <- stri_duplicated(ssf))) ssf[sd] else NA_character_
}

Map(f, Alt1, Alt2)
# [[1]]
# [1] NA
# 
# [[2]]
# [1] "A"
# 
# [[3]]
# [1] NA
# 
# [[4]]
# [1] "AT"
# 
# [[5]]
# [1] NA
# 
# [[6]]
# [1] "GG" "CC"

或者在R中，我们可以使用scan()来用逗号分隔字符串。

g <- function(x, y, sep = ",") {
    s <- scan(text = c(x, y), what = "", sep = sep, quiet = TRUE)
    s[duplicated(s)]
}
Map(g, Alt1, Alt2)

赞(0）回复(0）举报 2023-03-31

bvjxkvbb3#

你可以这样做：

Alt1 <- list('AT','A','G',c('CGTCC','AT'),'CGC')
Alt2 <- list('AA','A','GG',c('AT','GGT'),'CG')
# make sure you change the lists within in the lists into vectors

matchlist <- list()
for (i in 1:length(Alt1)){
  matchlist[[i]] <- ifelse(Alt1[[i]] %in% Alt2[[i]], 
                           paste("Row",i,"=",c(Alt1[[i]],Alt2[[i]])[duplicated(c(Alt1[[i]],Alt2[[i]]))],sep=" "),
                           paste("Row",i,"= none",sep=" ")) 
}
print(matchlist)

赞(0）回复(0）举报 2023-03-31

ubbxdtey4#

library(stringr)

示例 Dataframe

df <- data.frame(col1 = c("apple,banana,orange", "grape,pear", "cherry,kiwi"), col2 = c("banana,kiwi", "grape,pear", "orange,apple"))

查找匹配值的函数

find_matches <- function(x, y) {
x_vec <- str_split(x, ",")[[1]]
y_vec <- str_split(y, ",")[[1]]
intersect(x_vec, y_vec)
}

将函数应用到列中，并将结果存储在新列中

df$matches <- mapply(find_matches, df$col1, df$col2)

查看结果

df

赞(0）回复(0）举报 2023-03-31

我来回答

regex 两个逗号分隔的字符串之间的R匹配

4条答案

示例 Dataframe

查找匹配值的函数

将函数应用到列中，并将结果存储在新列中

查看结果

相关问题

热门标签

最新问答