使用该回购协议数据文件夹中的文件posts.csv
(2017年以来美国国会议员在Facebook上发布的10,000条公开帖子的样本),用dplyr
求解以下问题:
- 不考虑零赞的帖子
- 计算每个帖子的评论与喜欢的比率(即评论计数/喜欢计数),并将其存储在列
clr
中 - 对于每个
screen_name
,计算normaliser_based_on_even_months = max(clr) - min(clr)
,即,用screen_name
减去帖子的最大值clr
减去最小值screen_name
,但是,在为每个screen_name
计算max(clr) - min(clr)
时,仅考虑偶数月份的帖子,即,在2月、4月、6月、8月、10月、12月的帖子 - 将值为零的所有
normaliser_based_on_even_months
设置为NA或将其删除 - 然后创建一个列
normalised_clr
,用于存储原始 Dataframe 中所有帖子的clr
(除了那些在第一步中删除的零赞帖子)除以相关昵称的normaliser_based_on_even_months
。唯一的例外是来自昵称的帖子的normaliser_based_on_even_months
值为零,并且在--之前被删除/设置为NA对于这些帖子,只需将normalised_clr
中的值也设置为NA或从最终 Dataframe 中删除帖子。 - 换句话说,在该
normalised_clr
列中的单个帖子/行i
(由政治家p
书写)的值可以被计算为:normalised_clr_{i,p} = clr{i}/normaliser_based_on_even_months_{p}
,用于存在非NAnormaliser_based_on_even_months
的所有观测(但无需为此使用循环,dplyr
允许以矢量化方式计算) - 仅保留
normalised_clr
〉0的行 - 按照
normalised_clr
升序排列 Dataframe - 仅打印前10行的
screen_name
和normalised_clr
,即normalised_clr
最低的10个帖子
我已经为此编写了一个R程序。然而,我的任务是将其转换为一个SQLite查询。下面是我所做的:
# Create database
posts_db <- dbConnect(RSQLite::SQLite(), "C:/Users/HP/Documents/posts.sqlite")
# Reading the first file into R
data <- read.csv("C:/Users/HP/Documents/posts.csv",
stringsAsFactors = FALSE)
dbWriteTable(posts_db, "posts", data, overwrite = TRUE)
dbListFields(posts_db, "posts")
dbGetQuery(posts_db,"WITH
cte1 AS (SELECT screen_name, comments_count*1.0/likes_count AS clr,
strftime('%m', date) AS month FROM posts WHERE likes_count>0),
cte2 AS (SELECT (MAX(clr) - MIN(clr)) AS normaliser_based_on_even_months
FROM cte1
WHERE month % 2 = 0
GROUP BY screen_name),
cte3 AS (SELECT screen_name, clr, normaliser_based_on_even_months,
clr/normaliser_based_on_even_months AS normalized_clr FROM cte1, cte2
WHERE normaliser_based_on_even_months>0)
SELECT screen_name, normalized_clr FROM cte3
WHERE normalized_clr>0
ORDER BY normalized_clr")
下面是输出:
screen_name normalized_clr
1 repmarkpocan 0.0002546821
2 repmarkpocan 0.0002690018
3 CongressmanRalphAbraham 0.0002756995
4 CongressmanRalphAbraham 0.0002912010
5 derek.kilmer 0.0003549631
6 RepJimMcGovern 0.0003664136
7 CongresswomanNorton 0.0003687929
8 derek.kilmer 0.0003749212
9 RepJimMcGovern 0.0003870155
10 CongresswomanNorton 0.0003895286
Sample Data
下面是生成我希望SQLite查询生成的结果的R代码,以供参考:
posts <- read.csv("C:/Users/HP/Documents/posts.csv")
#Remove columns with zero likes
posts <- posts %>% filter(likes_count > 0)
#create 'clr' which is the comment to like ratio
posts <- posts %>% mutate(clr = comments_count / likes_count)
#Compute the maximum minus the minimum `clr` value of posts by that `screen_name`, however, only taking into account __the posts made in even months, i.e. posts made in in February, April, June, August, October, December__ when computing `max(clr) - min(clr)` for each `screen_name`. Code from https://stackoverflow.com/questions/30606360/subtract-value-from-previous-row-by-group
posts$date <- ymd(posts$date)
posts$date <- month(posts$date)
posts_normaliser <- posts %>% group_by(screen_name) %>% mutate(normaliser_based_on_even_months = case_when(date%%2==0 ~ (max(clr) - min(clr))))
#Set all `normaliser_based_on_even_months` that have a value of zero to NA or delete them
posts_normaliser <- posts_normaliser %>% filter(normaliser_based_on_even_months > 0)
#Afterwards create a column `normalised_clr` which stores the `clr` of all posts from the original data frame (other than those with zero likes which were deleted in the first step) divided by the `normaliser_based_on_even_months` of the associated screen name.
merged_df <- merge(posts, posts_normaliser)
merged_df <- merged_df %>% group_by(screen_name)%>% mutate(normalised_clr = clr / normaliser_based_on_even_months)
#Keep only those rows with `normalised_clr` \> 0
merged_df <- merged_df %>% filter(normalised_clr > 0)
#Arrange the data frame according to `normalised_clr` in ascending order
merged_df <- merged_df %>% arrange(normalised_clr)
#Print out only `screen_name` and `normalised_clr` for the first 10 rows, i.e. the posts with the 10 lowest `normalised_clr`
merged_df[1:10, c("screen_name", "normalised_clr")]
下面是R的输出:
> merged_df[1:10, c("screen_name", "normalised_clr")]
# A tibble: 10 × 2
# Groups: screen_name [5]
screen_name normalised_clr
<chr> <dbl>
1 CongresswomanSheilaJacksonLee 0.00214
2 CongresswomanSheilaJacksonLee 0.00218
3 CongresswomanSheilaJacksonLee 0.00277
4 RepMullin 0.00342
5 SenDuckworth 0.00342
6 CongresswomanSheilaJacksonLee 0.00357
7 replahood 0.00477
8 SenDuckworth 0.00488
9 SenDuckworth 0.00505
10 RepSmucker 0.00516
我不断地检查语句,但我不明白为什么输出不同。
1条答案
按热度按时间tkclm6bt1#
如果您查看SQL查询的输出,结果是原始的10,000行中有4,543,128行,这表明您的连接正在爆炸。因为您要连接的所有表要么是(a)原始行,要么是(b)汇总的唯一
screen_name
,所以我认为可以使用该列。我修改了您的查询以完成两件事:
screen_name
加到cte2
cte3
内部的连接limit 10
,因为您说的是“全部到单个查询中”,并且“最低10”是约束的一部分。这产生: