使用R语言从存储在Dataframe中的文本中提取不同的主题标签“#”

wlsrxk51  于 2023-04-03  发布在  R语言
关注(0)|答案(1)|浏览(202)

我有一个包含一些tweet的数据框,我想使用tidytext包的unnest_tokens()函数从tweet中提取hashtags,创建一个标记化的数据框,每个hashtags对应一行。
我的数据只有3列:

  1. Fecha:这是一个POSIXct变量类型的tweet日期。
  2. Usuario:这是数字变量类型的tweets的id用户。
  3. Texto:tweets在字符变量类型中的文本。

otros_numerales_numeral_petro  <- Numeral_Petro_sin_emojis %>% 
unnest_tokens(output = "hashtag", input = "Texto", token = "tweets") %>%
filter(str_starts(hashtag, "#"))

但是,当我运行代码时,我得到了这个错误:
错误:!对token = "tweets"的支持在tidytext 0.4.0中被弃用,现在已不起作用。
有人能帮我修一下吗。

42fyovps

42fyovps1#

是的,token = "tweets"选项在去年年底被弃用,因为上游依赖关系发生了变化。听起来你并不想对文本进行标记,而是提取所有的hashtag。我会这样做:

library(tidyverse)
library(rtweet)
bunny_tweets <- 
  search_tweets("#rabbits", n = 20, include_rts = FALSE) %>%
  filter(!possibly_sensitive, lang == "en")

bunny_tweets %>%
  mutate(hashtags = str_extract_all(full_text, "#\\S+")) %>%
  unnest(hashtags) %>%
  select(id, hashtags, full_text)
#> # A tibble: 142 × 3
#>         id hashtags          full_text                                          
#>      <dbl> <chr>             <chr>                                              
#>  1 1.64e18 #Animate          "This awesome comic deserves more attention!\n \n#…
#>  2 1.64e18 #Doujinshi        "This awesome comic deserves more attention!\n \n#…
#>  3 1.64e18 #rabbits          "This awesome comic deserves more attention!\n \n#…
#>  4 1.64e18 #april            "New baby bunny spotted! #april #rabbits\nBlack ba…
#>  5 1.64e18 #rabbits          "New baby bunny spotted! #april #rabbits\nBlack ba…
#>  6 1.64e18 #LFDIE            "Trust me! You'll get addicted to this story!\n \n…
#>  7 1.64e18 #rabbits          "Trust me! You'll get addicted to this story!\n \n…
#>  8 1.64e18 #huacheng         "Trust me! You'll get addicted to this story!\n \n…
#>  9 1.64e18 #digitalanimation "I've been completely addicted to ONEPIECE and Mar…
#> 10 1.64e18 #rabbits          "I've been completely addicted to ONEPIECE and Mar…
#> # … with 132 more rows

创建于2023-04-01使用reprex v2.0.2

相关问题