无法编写SQLite查询

q8l4jmvw  于 2023-01-21  发布在  SQLite
关注(0)|答案(1)|浏览(167)

使用该回购协议数据文件夹中的文件posts.csv(2017年以来美国国会议员在Facebook上发布的10,000条公开帖子的样本),用dplyr求解以下问题:

  • 不考虑零赞的帖子
  • 计算每个帖子的评论与喜欢的比率(即评论计数/喜欢计数),并将其存储在列clr
  • 对于每个screen_name,计算normaliser_based_on_even_months = max(clr) - min(clr),即,用screen_name减去帖子的最大值clr减去最小值screen_name,但是,在为每个screen_name计算max(clr) - min(clr)时,仅考虑偶数月份的帖子,即,在2月、4月、6月、8月、10月、12月的帖子
  • 将值为零的所有normaliser_based_on_even_months设置为NA或将其删除
  • 然后创建一个列normalised_clr,用于存储原始 Dataframe 中所有帖子的clr(除了那些在第一步中删除的零赞帖子)除以相关昵称的normaliser_based_on_even_months。唯一的例外是来自昵称的帖子的normaliser_based_on_even_months值为零,并且在--之前被删除/设置为NA对于这些帖子,只需将normalised_clr中的值也设置为NA或从最终 Dataframe 中删除帖子。
  • 换句话说,在该normalised_clr列中的单个帖子/行i(由政治家p书写)的值可以被计算为:normalised_clr_{i,p} = clr{i}/normaliser_based_on_even_months_{p},用于存在非NA normaliser_based_on_even_months的所有观测(但无需为此使用循环,dplyr允许以矢量化方式计算)
  • 仅保留normalised_clr〉0的行
  • 按照normalised_clr升序排列 Dataframe
  • 仅打印前10行的screen_namenormalised_clr,即normalised_clr最低的10个帖子

我已经为此编写了一个R程序。然而,我的任务是将其转换为一个SQLite查询。下面是我所做的:

# Create database
posts_db <- dbConnect(RSQLite::SQLite(), "C:/Users/HP/Documents/posts.sqlite")

# Reading the first file into R
data <- read.csv("C:/Users/HP/Documents/posts.csv",
    stringsAsFactors = FALSE)

dbWriteTable(posts_db, "posts", data, overwrite = TRUE)

dbListFields(posts_db, "posts")

dbGetQuery(posts_db,"WITH 
cte1 AS (SELECT screen_name, comments_count*1.0/likes_count AS clr, 
strftime('%m', date) AS month FROM posts WHERE likes_count>0), 
cte2 AS (SELECT (MAX(clr) - MIN(clr)) AS normaliser_based_on_even_months 
FROM cte1 
WHERE month % 2 = 0
GROUP BY screen_name),
cte3 AS (SELECT screen_name, clr, normaliser_based_on_even_months,
clr/normaliser_based_on_even_months AS normalized_clr FROM cte1, cte2 
WHERE normaliser_based_on_even_months>0)
SELECT screen_name, normalized_clr FROM cte3 
WHERE normalized_clr>0 
ORDER BY normalized_clr")

下面是输出:

screen_name normalized_clr
1             repmarkpocan   0.0002546821
2             repmarkpocan   0.0002690018
3  CongressmanRalphAbraham   0.0002756995
4  CongressmanRalphAbraham   0.0002912010
5             derek.kilmer   0.0003549631
6           RepJimMcGovern   0.0003664136
7      CongresswomanNorton   0.0003687929
8             derek.kilmer   0.0003749212
9           RepJimMcGovern   0.0003870155
10     CongresswomanNorton   0.0003895286

Sample Data
下面是生成我希望SQLite查询生成的结果的R代码,以供参考:

posts <- read.csv("C:/Users/HP/Documents/posts.csv")

#Remove columns with zero likes
posts <- posts %>% filter(likes_count > 0)

#create 'clr' which is the comment to like ratio
posts <- posts %>% mutate(clr = comments_count / likes_count) 

#Compute the maximum minus the minimum `clr` value of posts by that `screen_name`, however, only taking into account __the posts made in even months, i.e. posts made in in February, April, June, August, October, December__ when computing `max(clr) - min(clr)` for each `screen_name`. Code from https://stackoverflow.com/questions/30606360/subtract-value-from-previous-row-by-group

posts$date <- ymd(posts$date)
posts$date <- month(posts$date)

posts_normaliser <- posts %>% group_by(screen_name) %>% mutate(normaliser_based_on_even_months = case_when(date%%2==0 ~ (max(clr) - min(clr))))

#Set all `normaliser_based_on_even_months` that have a value of zero to NA or delete them
posts_normaliser <- posts_normaliser %>% filter(normaliser_based_on_even_months > 0)

#Afterwards create a column `normalised_clr` which stores the `clr` of all posts from the original data frame (other than those with zero likes which were deleted in the first step) divided by the `normaliser_based_on_even_months` of the associated screen name. 

merged_df <- merge(posts, posts_normaliser)

merged_df <- merged_df %>% group_by(screen_name)%>% mutate(normalised_clr = clr / normaliser_based_on_even_months)

#Keep only those rows with `normalised_clr` \> 0

merged_df <- merged_df %>% filter(normalised_clr > 0)

#Arrange the data frame according to `normalised_clr` in ascending order

merged_df <- merged_df %>% arrange(normalised_clr)

#Print out only `screen_name` and `normalised_clr` for the first 10 rows, i.e. the posts with the 10 lowest `normalised_clr`

merged_df[1:10, c("screen_name", "normalised_clr")]

下面是R的输出:

> merged_df[1:10, c("screen_name", "normalised_clr")]
# A tibble: 10 × 2
# Groups:   screen_name [5]
   screen_name                   normalised_clr
   <chr>                                  <dbl>
 1 CongresswomanSheilaJacksonLee        0.00214
 2 CongresswomanSheilaJacksonLee        0.00218
 3 CongresswomanSheilaJacksonLee        0.00277
 4 RepMullin                            0.00342
 5 SenDuckworth                         0.00342
 6 CongresswomanSheilaJacksonLee        0.00357
 7 replahood                            0.00477
 8 SenDuckworth                         0.00488
 9 SenDuckworth                         0.00505
10 RepSmucker                           0.00516

我不断地检查语句,但我不明白为什么输出不同。

tkclm6bt

tkclm6bt1#

如果您查看SQL查询的输出,结果是原始的10,000行中有4,543,128行,这表明您的连接正在爆炸。因为您要连接的所有表要么是(a)原始行,要么是(b)汇总的唯一screen_name,所以我认为可以使用该列。
我修改了您的查询以完成两件事:

  • screen_name加到cte2
  • 更新cte3内部的连接
  • 添加limit 10,因为您说的是“全部到单个查询中”,并且“最低10”是约束的一部分。
WITH
cte1 AS (
  SELECT screen_name, comments_count*1.0/likes_count AS clr,
    strftime('%m', date) AS month
  FROM posts WHERE likes_count > 0
),
cte2 AS (
  SELECT screen_name, (MAX(clr) - MIN(clr)) AS normaliser_based_on_even_months
  FROM cte1
  WHERE month % 2 = 0
  GROUP BY screen_name
),
cte3 AS (
  SELECT cte1.screen_name, clr, normaliser_based_on_even_months,
    clr/normaliser_based_on_even_months AS normalized_clr
  FROM cte1
    LEFT JOIN cte2 on cte1.screen_name = cte2.screen_name
  WHERE normaliser_based_on_even_months > 0
)
SELECT screen_name, normalized_clr
FROM cte3
  WHERE normalized_clr > 0
ORDER BY normalized_clr
LIMIT 10

这产生:

screen_name normalized_clr
1    SenDuckworth    0.002318900
2       RepMullin    0.003415301
3    SenDuckworth    0.003425708
4  repmikecoffman    0.003861004
5    SenDuckworth    0.004173566
6    SenDuckworth    0.004880627
7    SenDuckworth    0.005035819
8    SenDuckworth    0.005051889
9    SenDuckworth    0.005112120
10        RepBost    0.005600815

相关问题