我有下面的R代码,它模拟不同学生掷硬币的随机数据,计算所有学生获得的组合数,并计算条件概率(第三个结果给定前两个结果):
library(dplyr)
library(tidyverse)
ids = 1:100
student_id = sample(ids, 1000, replace = TRUE)
coin_result = sample(c("H", "T"), 1000, replace = TRUE)
my_data = data.frame(student_id, coin_result)
my_data = my_data[order(my_data$student_id),]
my_data = my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence)
final = my_data %>%
mutate(two_seq = substr(Sequence, 1, 2)) %>%
group_by(two_seq) %>%
mutate(third = substr(Sequence, 3, 3)) %>%
group_by(two_seq, third) %>%
summarize(sums = sum(n)) %>%
mutate(prob = sums / sum(sums))
# FINAL RESULT (Note: I think this is correct because the pairs of conditional probabilities add to 1)
# A tibble: 8 x 4
# Groups: two_seq [4]
two_seq third sums prob
<chr> <chr> <int> <dbl>
1 HH H 112 0.514
2 HH T 106 0.486
3 HT H 108 0.537
4 HT T 93 0.463
5 TH H 97 0.5
6 TH T 97 0.5
7 TT H 93 0.497
8 TT T 94 0.503
**我的问题:**我正在尝试将上面的代码R转换为(Netezza)SQL。
我做了一些研究,发现了一个LEAD()函数,它可能能够完成这一任务。
下面是我的尝试:
SELECT Sequence, COUNT(*) as count
FROM (
SELECT
STUDENT_ID,
coin_result ||
LEAD(coin_result, 1) OVER (PARTITION BY STUDENT_ID ORDER BY STUDENT_ID) ||
LEAD(coin_result, 2) OVER (PARTITION BY STUDENT_ID ORDER BY STUDENT_ID) ||
LEAD(coin_result, 3) OVER (PARTITION BY STUDENT_ID ORDER BY STUDENT_ID) AS Sequence
FROM my_data
) subquery
WHERE Sequence NOT LIKE '%NULL%'
GROUP BY Sequence;
我不确定我是否做对了-我也不确定如何添加条件概率列
有人能告诉我怎么继续吗?
谢谢!
1条答案
按热度按时间wfveoks01#
令人惊叹的dbplyr包有一个
show_query
函数,它显示了dbplyr的SQL转换dplyr代码。我用SQLite做了一个例子,所以你可能需要根据Netezza的SQL变体做一些修改。我把第一个summarise
修改为mutate
以使其工作,但结果是一样的。输出: