如何在R中基于区间匹配向data.table中添加列?[副本]

mpbci0fu  于 2023-05-26  发布在  其他
关注(0)|答案(1)|浏览(75)

此问题已在此处有答案

Overlap join with start and end positions(5个答案)
3天前关闭。
我有两个数据表,A和B。表A具有两列“chrom”和“pos”,而B表示从BED文件读取的一系列间隔。我想在data.table A中添加一个名为“select_status”的新列。如果一行的“pos”福尔斯在B中的任何区间内,则“select_status”中的相应值应设置为TRUE;否则,应将其设置为FALSE。
下面是一个示例来说明数据结构:

library(data.table)

A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
                pos = c(100, 200, 300, 391, 399))
B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
                start = c(150, 180, 250, 280, 390, 600),
                end = c(200, 220, 300, 320, 393, 900))

# I need add a col select_status to A, and set it to Ture if pos in B
# I want someting like this but this is wrong

A[, select_status := any(pos >= B$start & pos <= B$end & chrom == B$chrom)]

A[, select_status := sapply(.SD, function(x) any(x >= B$start & x <= B$end)), .SDcols = c("pos"), by = .(chrom)]

A[is.na(select_status), select_status := FALSE]

我的解决方案是不工作,因为它不比较位置和区域匹配的行在B中,位置chr3 399也将被设置为TURE
我知道可以使用apply逐行遍历A,然后将遍历的结果作为过滤器应用于B,以获得类似的结果,但在数据具有许多行的情况下,这会较慢,我想知道是否有另一种更简洁的方法
我期待结果

A
   chrom pos select_status
1:  chr1 100         FALSE
2:  chr2 200          TRUE
3:  chr3 300          TRUE
4:  chr3 391          TRUE
5:  chr3 399          FALSE
avwztpqn

avwztpqn1#

以下是可以考虑的一种方法:

library(data.table)

A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
                pos = c(100, 200, 300, 391, 399))

B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
                start = c(150, 180, 250, 280, 390, 600),
                end = c(200, 220, 300, 320, 393, 900))

X_Val <- eval(parse(text = paste0("c(",  paste0(paste0(B$start, ":", B$end), collapse = ","), ")")))
A[["select_status"]] <- ifelse(A$pos %in% X_Val, TRUE, FALSE)

 A
   chrom pos select_status
1:  chr1 100         FALSE
2:  chr2 200          TRUE
3:  chr3 300          TRUE
4:  chr3 391          TRUE
5:  chr3 399         FALSE

相关问题