如何高效地对R中的多个列进行重新编码?

avwztpqn  于 2022-12-20  发布在  其他
关注(0)|答案(3)|浏览(144)

我需要重新编码一些数据。首先,我想象原始数据看起来像这样

A data.frame: 6 × 5
 col1    col2    col3    col4    col5
<chr>   <chr>   <chr>   <chr>   <chr>
   s1  414234  244575  539645  436236
   s2      NA  512342  644252  835325
   s3      NA      NA  816747  475295
   s4      NA      NA      NA  125429
   s5      NA      NA      NA      NA
   s6  617465  844526      NA  194262

其次,它被转化成

A data.frame: 6 × 5
 col1    col2    col3    col4    col5
<chr>   <int>   <int>   <int>   <int>
   s1       4       2       5       4
   s2      NA       5       6       8
   s3      NA       NA      8       4
   s4      NA       NA     NA       1
   s5      NA       NA     NA      NA
   s6       6        8     NA       1

因为我将根据第一个数字对所有内容重新编码。第三次重新编码时(参见下面MWE中的重新编码模式),它应该如下所示

A data.frame: 6 × 5
 col1    col2    col3    col4    col5
<chr>   <int>   <int>   <int>   <int>
   s1       3       1       3       3
   s2      NA       3       4       5
   s3      NA       NA      5       3
   s4      NA       NA     NA       1
   s5      NA       NA     NA      NA
   s6       4        5     NA       1

第四,如果除了第一列之外的所有列都是空的,则应当移除整行,即

A data.frame: 6 × 5
 col1    col2    col3    col4    col5
<chr>   <int>   <int>   <int>   <int>
   s1       3       1       3       3
   s2      NA       3       4       5
   s3      NA       NA      5       3
   s4      NA       NA     NA       1
   s6       4        5     NA       1

这是最终的数据。
第一步和第二步很容易实现,但我很难实现第三步和第四步,因为我是R的新手(参见下面的MWE)。对于第三步,我尝试在多个列上使用mutate,但出现了Error in UseMethod("mutate"): no applicable method for 'mutate' applied to an object of class "c('integer', 'numeric')"。第四步很容易在Python中用thresh实现,但我不确定在R中是否有等价的。
这怎么可能呢?另外,我的工作涉及大量数据,因此高效的解决方案也会受到高度赞赏。

library(dplyr)

df <- data.frame(
    col1 = c("s1", "s2", "s3", "s4", "s5", "s6"),
    col2 = c("414234", NA, NA, NA, NA, "617465"),
    col3 = c("244575", "512342", NA, NA, NA, "844526"),
    col4 = c("539645", "644252", "816747", NA, NA, NA),
    col5 = c("436236", "835325", "475295", "125429", NA, "194262")
    )

n = ncol(df)

for (i in colnames(df[2:n])) {
    df[, i] = strtoi(substr(df[, i], 1, 1))
}

for (i in colnames(df[2:n])) {
    df[, i] %>% mutate(i=recode(i, "0": 1, "1": 1, "2": 1, "3": 2, "4": 3, "5": 3, "6": 4, "7": 5, "8": 5))
}
z9ju0rcb

z9ju0rcb1#

基本R方式:

# cut out just the numeric columns
df2 <- as.matrix(df[, -1])
# first digits
df2[] <- substr(df2, 1, 1)
mode(df2) <- 'numeric'
# recode
df2[] <- c(1, 1, 1, 2, 3, 3, 4, 5, 5)[df2+1]
# write back into the original data frame
df[, -1] <- df2
# remove rows with NAs only
df <- df[apply(df[, -1], 1, \(x) !all(is.na(x))), ]

df
#   V1 V2 V3 V4 V5
# 1 s1  3  1  3  3
# 2 s2 NA  3  4  5
# 3 s3 NA NA  5  3
# 4 s4 NA NA NA  1
# 6 s6  4  5 NA  1

正如您所看到的,没有必要按列执行操作,因为它们可以整体执行,这样效率会更高。

jpfvwuh4

jpfvwuh42#

你可以用tidyverse包的组合来实现这一点。我们通常避免在R中使用for循环,除非我们真的需要它们。几乎总是更倾向于vetorise。

library(dplyr)
library(stringr) # for str_sub
library(purrr)   # for negate

mat = matrix(c(     "s1",     "s2",     "s3",     "s4",     "s5",     "s6",
                    "414234",       NA,       NA,       NA,       NA, "617465", 
                    "244575", "512342",       NA,       NA,       NA, "844526",
                    "539645", "644252", "816747",       NA,       NA,       NA,
                    "436236", "835325", "475295", "125429",       NA, "194262"),
             nrow=6,
             ncol=5
)

df <- as.data.frame(mat)

## Step 1: Extract first character of each element
df <- mutate(df, across(V2:V5, str_sub, 1, 1))
head(df)
#>   V1   V2   V3   V4   V5
#> 1 s1    4    2    5    4
#> 2 s2 <NA>    5    6    8
#> 3 s3 <NA> <NA>    8    4
#> 4 s4 <NA> <NA> <NA>    1
#> 5 s5 <NA> <NA> <NA> <NA>
#> 6 s6    6    8 <NA>    1

## Step 3: Recode
df <- mutate(df, 
             across(V2:V5, 
             recode,
             `0` = "1", `1` = "1", `2` = "1", `3` = "2", 
             `4` = "3", `5` = "3", `6` = "4", `7` = "5", `8` = "5"
             ))

## Step 2: convert all columns to numeric
df <- mutate(df, across(V2:V5, as.numeric))

head(df)
#>   V1 V2 V3 V4 V5
#> 1 s1  3  1  3  3
#> 2 s2 NA  3  4  5
#> 3 s3 NA NA  5  3
#> 4 s4 NA NA NA  1
#> 5 s5 NA NA NA NA
#> 6 s6  4  5 NA  1

## Step 4: filter all rows where every value is numeric
## By purrr::negate()-ing is.na, we can select rows only rows where 
## at least one value is not missing

df <- filter(df, if_any(V2:V5, negate(is.na)))
df
#>   V1 V2 V3 V4 V5
#> 1 s1  3  1  3  3
#> 2 s2 NA  3  4  5
#> 3 s3 NA NA  5  3
#> 4 s4 NA NA NA  1
#> 5 s6  4  5 NA  1

创建于2022年12月13日,使用reprex v2.0.2

aiazj4mn

aiazj4mn3#

这一个用了复杂的数学

df |>
    pivot_longer(col2:col5, values_to = "val", names_to = "col") |>
    mutate(val = map_dbl(as.integer(val),
                         ~c(1, 1, 1, 2, 3, 3, 4, 5, 5)[.x %/% 10^trunc(log10(.x)) +1])) |>
    filter(!is.na(val)) |>
    pivot_wider(values_from = val, names_from = col )

##> + # A tibble: 5 × 5
##>   col1   col2  col3  col4  col5
##>   <chr> <dbl> <dbl> <dbl> <dbl>
##> 1 s1        3     1     3     3
##> 2 s2       NA     3     4     5
##> 3 s3       NA    NA     5     3
##> 4 s4       NA    NA    NA     1
##> 5 s6        4     5    NA     1

相关问题