R语言 SQL:按组使用LEAD函数

rjee0c15  于 2023-03-15  发布在  其他
关注(0)|答案(1)|浏览(134)

我有下面的R代码,它模拟不同学生掷硬币的随机数据,计算所有学生获得的组合数,并计算条件概率(第三个结果给定前两个结果):

library(dplyr)

    library(tidyverse)
    

   ids = 1:100
student_id = sample(ids, 1000, replace = TRUE)
coin_result = sample(c("H", "T"), 1000, replace = TRUE)
my_data = data.frame(student_id, coin_result)

my_data =  my_data[order(my_data$student_id),]

my_data = my_data %>%
  group_by(student_id) %>%
  summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2)), .groups = 'drop') %>%
  filter(!is.na(Sequence)) %>%
  count(Sequence)

final = my_data %>%
    mutate(two_seq = substr(Sequence, 1, 2)) %>%
    group_by(two_seq) %>%
    mutate(third = substr(Sequence, 3, 3)) %>%
    group_by(two_seq, third) %>%
    summarize(sums = sum(n)) %>%
    mutate(prob = sums / sum(sums))

 # FINAL RESULT (Note: I think this is correct because the pairs of conditional probabilities add to 1)
# A tibble: 8 x 4
# Groups:   two_seq [4]
  two_seq third  sums  prob
  <chr>   <chr> <int> <dbl>
1 HH      H       112 0.514
2 HH      T       106 0.486
3 HT      H       108 0.537
4 HT      T        93 0.463
5 TH      H        97 0.5  
6 TH      T        97 0.5  
7 TT      H        93 0.497
8 TT      T        94 0.503

**我的问题:**我正在尝试将上面的代码R转换为(Netezza)SQL。

我做了一些研究,发现了一个LEAD()函数,它可能能够完成这一任务。
下面是我的尝试:

SELECT Sequence, COUNT(*) as count
    FROM (
      SELECT
        STUDENT_ID,
        coin_result ||
          LEAD(coin_result, 1) OVER (PARTITION BY STUDENT_ID ORDER BY STUDENT_ID) ||
          LEAD(coin_result, 2) OVER (PARTITION BY STUDENT_ID ORDER BY STUDENT_ID) ||
          LEAD(coin_result, 3) OVER (PARTITION BY STUDENT_ID ORDER BY STUDENT_ID) AS Sequence
      FROM my_data
    ) subquery
    WHERE Sequence NOT LIKE '%NULL%'
    GROUP BY Sequence;

我不确定我是否做对了-我也不确定如何添加条件概率列

有人能告诉我怎么继续吗?
谢谢!

wfveoks0

wfveoks01#

令人惊叹的dbplyr包有一个show_query函数,它显示了dbplyr的SQL转换dplyr代码。我用SQLite做了一个例子,所以你可能需要根据Netezza的SQL变体做一些修改。我把第一个summarise修改为mutate以使其工作,但结果是一样的。

library(tidyverse)

ids = 1:100
student_id = sample(ids, 1000, replace = TRUE)
coin_result = sample(c("H", "T"), 1000, replace = TRUE)
my_data = data.frame(student_id, coin_result)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, my_data)

tbl(con, "my_data") %>% 
  group_by(student_id) %>%
  mutate(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2))) %>%
  ungroup() %>% 
  filter(!is.na(Sequence)) %>%
  count(Sequence) %>% 
  mutate(two_seq = substr(Sequence, 1, 2)) %>%
  group_by(two_seq) %>%
  mutate(third = substr(Sequence, 3, 3)) %>%
  group_by(two_seq, third) %>%
  summarize(sums = sum(n)) %>%
  mutate(prob = sums / sum(sums)) %>% 
  show_query()

输出:

<SQL>
SELECT `two_seq`, `third`, `sums`, `sums` / SUM(`sums`) OVER (PARTITION BY `two_seq`) AS `prob`
FROM (SELECT `two_seq`, `third`, SUM(`n`) AS `sums`
FROM (SELECT `Sequence`, `n`, `two_seq`, SUBSTR(`Sequence`, 3, 1) AS `third`
FROM (SELECT `Sequence`, `n`, SUBSTR(`Sequence`, 1, 2) AS `two_seq`
FROM (SELECT `Sequence`, COUNT(*) AS `n`
FROM (SELECT `student_id`, `coin_result`, CONCAT_WS('', `coin_result`, LEAD(`coin_result`, 1, NULL) OVER (PARTITION BY `student_id`), LEAD(`coin_result`, 2.0, NULL) OVER (PARTITION BY `student_id`)) AS `Sequence`
FROM `my_data`)
WHERE (NOT(((`Sequence`) IS NULL)))
GROUP BY `Sequence`)))
GROUP BY `two_seq`, `third`)

相关问题