如何在参数中使用data.table fifelse和vectors?

flmtquvp  于 2023-05-26  发布在  其他
关注(0)|答案(3)|浏览(102)

假设我有此数据。帧

DF <- data.frame(one=c(1, NA, NA, 1, NA, NA), two=c(NA,1,NA, NA, NA,1), 
         three=c(NA,NA, 1, NA, 1,NA))

one    two  three         output
  1     NA    NA             one
 NA      1    NA             two
 NA     NA     1           three
  1     NA    NA             one  
 NA     NA     1           three
 NA      1    NA             two

这些列是互斥的。
我需要生成输出

output=c("one","two","three","one","three", "two")

我试过用data.table fifelse但是

with(DF,fifelse(one==1, "one", fifelse(two==1,"two", "three", na="three"), 
   na=fifelse(two==1,"two", "three", na="three")))

Error in fifelse(one == 1, "one", fifelse(two == 1, "two", "three", na = "three"),  : 
  Length of 'na' is 6 but must be 1

它似乎不接受参数上的向量。
dplyr的if_else在这里工作得很好。

with(DF,if_else(one==1, "one", if_else(two==1,"two", "three", missing="three"), 
   missing=if_else(two==1,"two", "three", missing="three")))

我怎样才能得到与data.table相同的输出?
任何其他简单的选择。我可以用R碱基

apply(DF,1, function(x) which(!is.na(x)))

然后用字符替换这些数字。

jfewjypa

jfewjypa1#

另一个数据.表替代:

for (col in names(DF)) set(DF, which(DF[[col]] == 1), j = "output", value = col)
deyfvvtc

deyfvvtc2#

如果每行只有一个非NA值,可以尝试max.col

> names(DF)[max.col(!is.na(DF))]
[1] "one"   "two"   "three" "one"   "three" "two"

col + na.omit(但如果您追求速度,则可能会很慢)

> names(DF)[na.omit(c(t(col(DF) * DF)))]
[1] "one"   "two"   "three" "one"   "three" "two"

对标

microbenchmark(
    f1 = names(DF)[max.col(!is.na(DF))],
    f2 = names(DF)[na.omit(c(t(col(DF) * DF)))]
)

给予

Unit: microseconds
 expr   min     lq    mean median    uq    max neval
   f1  28.5  51.45  92.343  64.40  91.8 1532.5   100
   f2 300.7 527.65 634.755 595.35 691.5 2405.4   100
ct2axkht

ct2axkht3#

fifelse不是最好的工具,我建议fcase更容易:

data.table

library(data.table)
as.data.table(DF)[, fcase(one == 1, "one", two == 1, "two", three == 1, "three")]
# [1] "one"   "two"   "three" "one"   "three" "two"

dplyr

dplyr模拟值为case_when

library(dplyr)
with(DF, case_when(one == 1 ~ "one", two == 1 ~ "two", three == 1 ~ "three"))
# [1] "one"   "two"   "three" "one"   "three" "two"

base R

data.tabledplyr实现都假定预先知道列名。一个base-R方法,它不知道:

colnames(DF)[apply(DF, 1, which.max)]
# [1] "one"   "two"   "three" "one"   "three" "two"

(顺便说一句,which.max也可以是which.min,实际上我们只是在寻找一个非NA的值。
在这种情况下,如果您有其他不应该考虑的列,您将需要在apply(DF, ...)中设置DF的子集,以便它只查看所需的列。

相关问题