R:如果位置福尔斯区间内,则将列添加到 Dataframe

yc0p9oo0  于 2023-04-03  发布在  其他
关注(0)|答案(1)|浏览(82)

我有2个文件:
“query.tab“

grp   pos
1   10
1   45
2   6
3   12

“data.tab“

grp   start   end   info
1   1   15   blue
1   23   60   red
2   1   40   green
3   20   30   black

我正尝试将$info从文件“data”添加到文件“query”,只有在
1.“查询”中的$grp与“数据”中的$grp匹配
1.从query.tab$pos福尔斯从data.tab$start$end之间。
为了得到:

grp  pos   info
   1    10    blue
   1    45    red
   2    6     green
   3    12    NA

:非重叠的$info可以是'NA'或空白,这并不重要。无论如何都不应该发生)
到目前为止,我正在使用findOverlaps(),但在理解如何操作其输出时遇到了麻烦:

library(IRanges)

query =data.frame(grp = as.numeric(c("1", "1", "2", "3")), pos = as.numeric(c("10", "45", "6", "12")))
data = data.frame(grp=as.numeric(c("1", "1", "2", "3")), start=as.numeric(c("1", "23", "1", "20")), end=as.numeric(c("15", "60", "40", "30")), info=c("blue", "red", "green", "black"))

query.ir <- IRanges(start = query$pos, end = query$pos, names = query$grp)
data.ir <- IRanges(start = data$start, end = data$end, names = data$grp)

o <- findOverlaps(query.ir, data.ir, type = "within")
o
Hits object with 7 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           3
  [2]         1           1
  [3]         2           2
  [4]         3           3
  [5]         3           1
  [6]         4           3
  [7]         4           1
  -------

queryLength: 4 / subjectLength: 4

我可以从这个输出中检索$info字段吗?还是我在错误的轨道上?

vjhs03f7

vjhs03f71#

根据你所提出的预期输出,我认为这将工作。它也可以总结,但我更喜欢这个版本,以避免任何混乱;

#merge two data.frame to get info for all groups and positions
df <- merge(query.tab,data.tab, by = "grp")

#get the rows that are not duplicated but may be non-overlapping
#first non-duplicates
#second non-overlapping
#third remove info and replace by NA as it's non-overlapping rows
df.nas <- df[!(duplicated(df[,c(1,2)]) | duplicated(df[,c(1,2)], fromLast = TRUE)), ]
df.nas <- df.nas[df.nas$pos>df.nas$end | df.nas$pos<df.nas$start, ]
df.nas$info <- NA

#only keep the rows that are overlapping (position between start and end)
df.cnd <- df[df$pos<=df$end & df$pos>=df$start, ]

#merge overlapped and non-overlapped data.frames
df.mrg <- rbind(df.cnd, df.nas)

#remove start and end columns and sort based on group and position
df.final <- df.mrg[with(df.mrg,order(grp, pos)),c(1,2,5)]

#output:
df.final

#   grp pos  info 
# 1   1  10  blue 
# 4   1  45   red 
# 5   2   6 green 
# 6   3  12  <NA>

数据:

read.table(text='grp   pos
       1   10
       1   45
       2   6
       3   12', header=TRUE, quote='"') -> query.tab

read.table(text='grp   start   end   info
       1   1   15   blue
       1   23   60   red
       2   1   40   green
       3   20   30   black', header=TRUE, quote='"') -> data.tab

相关问题