当不知道所有类别的列表时，在R(dplyr)中将连接的文本列拆分为二进制列

qhhrdooz 于 2023-03-20 发布在其他

关注(0)|答案(2)|浏览(69)

我看到过类似的问题，但我还没有找到一个很好的解决方案，我在tidyr和dplyr。我有一个不同的文本类别的连接列，我不知道他们的完整列表。这是大数据，我不能确定这个列中的所有类别。我需要为每个ID将它们全部拆分，并创建一个相应的二进制列，指示该类别是否与ID对齐。

df <- data.frame(id=c(1,2,3,4,5),
                   number=c("a,b,d","e,a","c","","k,t"))
df %>% glimpse()

Rows: 5
Columns: 2
$ id     <dbl> 1, 2, 3, 4, 5
$ number <chr> "a,b,d", "e,a", "c", "", "k,t"

我想要的数据会像

id a b C d e k t
1  1 1 0 1 0 0 0
2  1 0 1 0 1 0 0
3  0 0 1 0 0 0 0 
4  0 0 0 0 0 0 0 
5  0 0 0 0 0 1 1

先谢谢你，我希望你说得很清楚。

来源：https://stackoverflow.com/questions/75759618/spliting-a-concatenated-text-column-into-binary-columns-in-r-dplyr-when-a-list

2条答案

按热度按时间

n3schb8v1#

这种方法使用tidyr中的separate_longer_delim来分割文本类别，然后借用recipes中一个方便的伪编码函数进行one-hot编码。

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(dplyr)
library(tidyr)

df <- data.frame(id=c(1,2,3,4,5),
                 number=c("a,b,d","e,a","c","","k,t"))

df %>%
  separate_longer_delim(number, delim = ',') %>%
  recipe(~.) %>%
  step_dummy(number, one_hot = TRUE) %>%
  prep() %>%
  bake(new_data = NULL) %>%
  group_by(id) %>%
  summarize(across(everything(), max))
#> # A tibble: 5 × 9
#>      id number_X number_a number_b number_c number_d number_e number_k number_t
#>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1     1        0        1        1        0        1        0        0        0
#> 2     2        0        1        0        0        0        1        0        0
#> 3     3        0        0        0        1        0        0        0        0
#> 4     4        1        0        0        0        0        0        0        0
#> 5     5        0        0        0        0        0        0        1        1

创建于2023年3月16日，使用reprex v2.0.2

赞(0）回复(0）举报 2023-03-20

kqhtkvqz2#

首先使用基R的strsplit提取唯一类别，然后获得每个类别的存在，其中lapply在每行上为grepls，as.data.frame为：

df <- data.frame(id=c(1,2,3,4,5),
                     number=c("a,b,d","e,a","c","","k,t"))
    
    
    categories <- unique(unlist(strsplit(df$number, ",")))
    
    data <- as.data.frame(lapply(categories,
           function(x) as.numeric(grepl(x, df$number))))
    names(data) <- categories

#> data
#  a b d e c k t
#1 1 1 1 0 0 0 0
#2 1 0 0 1 0 0 0
#3 0 0 0 0 1 0 0
#4 0 0 0 0 0 0 0
#5 0 0 0 0 0 1 1

赞(0）回复(0）举报 2023-03-20

我来回答

当不知道所有类别的列表时，在R(dplyr)中将连接的文本列拆分为二进制列

2条答案

相关问题

热门标签

最新问答