I am downloading all the Tweets (using rtweet package, version 0.7.0) that contain the user @sernac in the text of the tweet (a Chilean government entity), then extract all all the usernames (screen name) from the body of the tweet using the following function.
Tweets <- search_tweets("@sernac", n = 50000, include_rts = F)
Names <- str_extract_all(Tweets$text, "(?<=^|\\s)@[^\\s]+")
This give me a List object with the every screen name of each text's tweet.
The first question is: How i get a data frame whith the following estructure?
| X1 | X2 | X3 | X4 | X5 | ... | Xn |
| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
| @sernac | @vtrchile | NA | NA | NA | NA | NA |
| @username | @playstation | @taylorswitft | @elonmusk | @instagram | NA | NA |
| @username2 | @username5 | @selenagomez | @username2 | @username3 | @FIFA | @xbox |
| @username4 | @ebay | NA | NA | NA | NA | NA |
Where the numbers of columns is equal to the max number of elements in a object from the list.
I tried the following fuction, but only return 4 columns, where the max elements into a object is 9.
df <- data.frame(matrix(unlist(Names), nrow=length(Names), byrow = T))
After this, I need to perform a left join between this table and a cluster table created by me, this left join must be between the first column of the newly created database and the cluster data base , but if there is no match in the left join, it should perform a second left join, but in this case using the second column, until exhausting all the columns if there is no match when performing the left join.
This is an example of the database created by me and the final desired result:
CLUSTER DATA FRAME
screen_name | cluster |
---|---|
@sernac | Gov |
@playstation | Videogames |
@walmart | Supermarket |
@SelenaGomez | Celebrity |
@elonmusk | Celebrity |
@xbox | Videogames |
@ebay | Ecommerce |
FINAL RESULT
X1 | X2 | X3 | X4 | X5 | ... | Xn | cluster |
---|---|---|---|---|---|---|---|
@sernac | @vtrchile | NA | NA | NA | NA | NA | Gov |
@username | @playstation | @taylorswitft | @elonmusk | NA | NA | Videogames | |
@username2 | @username5 | @selenagomez | @username2 | @username3 | @FIFA | @xbox | Celebrity |
@username4 | @ebay | NA | NA | NA | NA | NA | Ecommerce |
I have tried to explain myself in the best way, English is not my main language, so I can explain more detail in the comments.
1条答案
按热度按时间6uxekuva1#
我会用不同的方式处理这个问题。
首先,如果你想下载尽可能多的推文,设置
n = Inf
和retryonratelimit = TRUE
:其次,不需要从tweet文本中提取屏幕名称,因为可以在
entities
列中找到此信息。提取提及的一种方法是使用
lapply
。然后,您可以创建一个只包含有用列的数据框,并将屏幕名称转换为小写以进行匹配。接下来,将屏幕名称为小写的列添加到群集数据中:
现在我们可以在
screen_name_lc
列上连接 Dataframe :这种“长”格式比“宽”格式更容易用于后续分析,并且仍然可以使用
tweet_number
列按tweet分组。cluster_df
的数据: