从Twitter到DataFrame的屏幕名称- R

y53ybaqx  于 2022-12-06  发布在  其他
关注(0)|答案(1)|浏览(120)

I am downloading all the Tweets (using rtweet package, version 0.7.0) that contain the user @sernac in the text of the tweet (a Chilean government entity), then extract all all the usernames (screen name) from the body of the tweet using the following function.

Tweets <-  search_tweets("@sernac", n = 50000, include_rts = F)
Names <- str_extract_all(Tweets$text, "(?<=^|\\s)@[^\\s]+")

This give me a List object with the every screen name of each text's tweet.
The first question is: How i get a data frame whith the following estructure?
| X1 | X2 | X3 | X4 | X5 | ... | Xn |
| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
| @sernac | @vtrchile | NA | NA | NA | NA | NA |
| @username | @playstation | @taylorswitft | @elonmusk | @instagram | NA | NA |
| @username2 | @username5 | @selenagomez | @username2 | @username3 | @FIFA | @xbox |
| @username4 | @ebay | NA | NA | NA | NA | NA |
Where the numbers of columns is equal to the max number of elements in a object from the list.
I tried the following fuction, but only return 4 columns, where the max elements into a object is 9.

df <- data.frame(matrix(unlist(Names), nrow=length(Names), byrow = T))

After this, I need to perform a left join between this table and a cluster table created by me, this left join must be between the first column of the newly created database and the cluster data base , but if there is no match in the left join, it should perform a second left join, but in this case using the second column, until exhausting all the columns if there is no match when performing the left join.
This is an example of the database created by me and the final desired result:

CLUSTER DATA FRAME

screen_namecluster
@sernacGov
@playstationVideogames
@walmartSupermarket
@SelenaGomezCelebrity
@elonmuskCelebrity
@xboxVideogames
@ebayEcommerce

FINAL RESULT

X1X2X3X4X5...Xncluster
@sernac@vtrchileNANANANANAGov
@username@playstation@taylorswitft@elonmusk@instagramNANAVideogames
@username2@username5@selenagomez@username2@username3@FIFA@xboxCelebrity
@username4@ebayNANANANANAEcommerce

I have tried to explain myself in the best way, English is not my main language, so I can explain more detail in the comments.

6uxekuva

6uxekuva1#

我会用不同的方式处理这个问题。
首先,如果你想下载尽可能多的推文,设置n = Infretryonratelimit = TRUE

Tweets <-  search_tweets("@sernac", 
                         n = Inf, 
                         include_rts = FALSE, 
                         retryonratelimit = TRUE)

其次,不需要从tweet文本中提取屏幕名称,因为可以在entities列中找到此信息。
提取提及的一种方法是使用lapply。然后,您可以创建一个只包含有用列的数据框,并将屏幕名称转换为小写以进行匹配。

library(dplyr)

mentions <- lapply(Tweets$entities, function(x) x$user_mentions) %>%
  bind_rows(.id = "tweet_number") %>%
  select(tweet_number, screen_name) %>%
  mutate(screen_name_lc = tolower(screen_name))

head(mentions)

  tweet_number    screen_name screen_name_lc
1            1 mundo_pacifico mundo_pacifico
2            1       OIMChile       oimchile
3            1   subtel_chile   subtel_chile
4            1 ReclamosSubtel reclamossubtel
5            1         SERNAC         sernac
6            2 mundo_pacifico mundo_pacifico

接下来,将屏幕名称为小写的列添加到群集数据中:

cluster_df <- cluster_df %>% 
  mutate(screen_name_lc = str_replace(screen_name, "@", "") %>% 
         tolower())

现在我们可以在screen_name_lc列上连接 Dataframe :

mentions_clusters <- mentions %>% 
  left_join(cluster_df, 
            by = "screen_name_lc") %>% 
  select(tweet_number, screen_name = screen_name.x, cluster)

head(mentions_clusters)

  tweet_number    screen_name cluster
1            1 mundo_pacifico    <NA>
2            1       OIMChile    <NA>
3            1   subtel_chile    <NA>
4            1 ReclamosSubtel    <NA>
5            1         SERNAC     Gov
6            2 mundo_pacifico    <NA>

这种“长”格式比“宽”格式更容易用于后续分析,并且仍然可以使用tweet_number列按tweet分组。
cluster_df的数据:

cluster_df <- structure(list(screen_name = c("@sernac", "@playstation", "@walmart", 
"@SelenaGomez", "@elonmusk", "@xbox", "@ebay"), cluster = c("Gov", 
"Videogames", "Supermarket", "Celebrity", "Celebrity", "Videogames", 
"Ecommerce"), screen_name_lc = c("sernac", "playstation", "walmart", 
"selenagomez", "elonmusk", "xbox", "ebay")), class = "data.frame", row.names = c(NA, 
-7L))

相关问题