有没有办法避免这里的for循环?

t2a7ltrp  于 2023-04-18  发布在  其他
关注(0)|答案(5)|浏览(177)

我有一个字符变量,它有从0到5的数字存储在它不同的长度。我想创建5个虚拟变量,显示如果数字(0到5)存在于给定的行。我能够实现这一点:

library(data.table)

dataset <- 
  data.table(
    'char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0')
  )

for(i in c(0:5)){
  
  dataset[grepl(i, char), c(paste0('Idx_', i)) := 1]
  
}

导致:

char    Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
1: 1 2 3 0     1     1     1     1    NA    NA
2:   1 5 0     1     1    NA    NA    NA     1
3:   1 2 0     1     1     1    NA    NA    NA
4:     1 0     1     1    NA    NA    NA    NA
5: 1 2 4 0     1     1     1    NA     1    NA

由于我的数据集相当大,我知道这通常是一个好主意,以避免for循环,我很好奇,如果它可以做到这一点没有一个for循环。我尝试了周围的组合.SD,应用和“by = 1:nrow(dataset)",但没有它为我工作..

vbopmzt1

vbopmzt11#

我建议你修改一下你目前的方法,稍微快一点(因为循环在R中并不总是坏的):

for (i in 0:5) {
  set(dataset, grep(i, dataset$char, fixed=TRUE), j=paste0('Idx_', i), 1L)
}

另一种选择:

dataset[, unlist(strsplit(char, " ")), by = .I
        ][, dcast(
              .SD, 
              I ~ paste0("idx_", V1), 
              fun.aggregate = \(x) if (length(x)) 1L else NA_integer_
            )]

#        I idx_0 idx_1 idx_2 idx_3 idx_4 idx_5
#    <int> <int> <int> <int> <int> <int> <int>
# 1:     1     1     1     1     1    NA    NA
# 2:     2     1     1    NA    NA    NA     1
# 3:     3     1     1     1    NA    NA    NA
# 4:     4     1     1    NA    NA    NA    NA
# 5:     5     1     1     1    NA     1    NA
s4chpxco

s4chpxco2#

这将是功能方法:

library(data.table)

dataset <- 
  data.table(
    'char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0')
  )

dataset[,(paste0('Idx_', 1:5)) := lapply(1:5, \(x) ifelse(grepl(x, char), 1, NA))]

dataset
#>       char Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#> 1: 1 2 3 0     1     1     1    NA    NA
#> 2:   1 5 0     1    NA    NA    NA     1
#> 3:   1 2 0     1     1    NA    NA    NA
#> 4:     1 0     1    NA    NA    NA    NA
#> 5: 1 2 4 0     1     1    NA     1    NA
rjzwgtxy

rjzwgtxy3#

如果我们有数字而不是数字,grepl 将以同样的方式匹配1和11。为了避免这种情况,我们可以在空格上分割(tstrsplit),重塑宽到长(melt),然后用 fun.aggregate 重塑它回到长到宽(dcast),参见示例:

#example data with 11 and 23
d <- data.table(char = c('1 2 3 0', 
                         '11 5 0', 
                         '1 23 0', 
                         '1 0',
                         '1 2 4 0'))

# get number max of columns
colMax <- max(stringr::str_count(d$char, " ")) + 1

d[, paste0("c", seq.int(colMax)) := tstrsplit(char, split = " ", type.convert = TRUE) 
        ][, melt(.SD, id.vars = "char") 
          ][ !is.na(value), dcast(.SD, char ~ value, 
                                  fun.aggregate = \(x){ as.integer(any(!is.na(x))) }) ]

#       char 0 1 2 3 4 5 11 23
# 1:     1 0 1 1 0 0 0 0  0  0
# 2: 1 2 3 0 1 1 1 1 0 0  0  0
# 3: 1 2 4 0 1 1 1 0 1 0  0  0
# 4:  1 23 0 1 1 0 0 0 0  0  1
# 5:  11 5 0 1 0 0 0 0 1  1  0
ubof19bj

ubof19bj4#

这是一个Base R的解决方案,如果你的data.frame非常大,你可以使用parallel和parallel::parlapply包来代替外部的lapply。

# I use a normal data frame instead
dataset <- data.frame('char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0'))

# reading from inside out, we first split the strings on whitespaces convert to 
# numeric and then match all digits from 0:5 thus obtaining a column for ever digit 
# in our new df
do.call(rbind, lapply(sapply(strsplit(dataset[, "char"], "\\s"), as.numeric), \(numbers){
  
  # use match to see which of th edigits 0:5 is in the respective row
  seq(0, 5) %in% numbers
  
  
})) -> matched_res

# fix colnames 
colnames(matched_res) <- paste0("ind_", 0:5)

# bind
cbind(dataset, matched_res)

#      char ind_0 ind_1 ind_2 ind_3 ind_4 ind_5
# 1 1 2 3 0  TRUE  TRUE  TRUE  TRUE FALSE FALSE
# 2   1 5 0  TRUE  TRUE FALSE FALSE FALSE  TRUE
# 3   1 2 0  TRUE  TRUE  TRUE FALSE FALSE FALSE
# 4     1 0  TRUE  TRUE FALSE FALSE FALSE FALSE
# 5 1 2 4 0  TRUE  TRUE  TRUE FALSE  TRUE FALSE
0yg35tkg

0yg35tkg5#

一个Tidyverse的方法只是为了记录(不试图在这里的速度方面竞争...):

library(tidyverse)

df <- tibble('char' = c('1 2 3 0', '1 5 0', '1 2 0', '1 0', '1 2 4 0'))

df |> 
  mutate(row = row_number(), .before = everything()) |> 
  separate_longer_delim(char, delim = " ") |> 
  arrange(char) |> 
  pivot_wider(
    names_from = char, 
    names_prefix = "Idx_",
    values_from = char, 
    values_fn = \(x) 1
  ) |> 
  select(!row) |> 
  mutate(char = df$char, .before = everything())
#> # A tibble: 5 × 7
#>   char    Idx_0 Idx_1 Idx_2 Idx_3 Idx_4 Idx_5
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 0     1     1     1     1    NA    NA
#> 2 1 5 0       1     1    NA    NA    NA     1
#> 3 1 2 0       1     1     1    NA    NA    NA
#> 4 1 0         1     1    NA    NA    NA    NA
#> 5 1 2 4 0     1     1     1    NA     1    NA

创建于2023-04-13带有reprex v2.0.2

相关问题