对SQL和/或Pandas中的连续出现进行计数

h5qlskok 于 2023-05-05 发布在其他

关注(0)|答案(3)|浏览(135)

我想查一下每个学生有多少次连续缺课。我有一个名为class_absenses的表，其中包含以下字段：
id，student_id，present（0代表缺席，1代表出席），date
我想返回一个students_id的列表，以及他们从今天到过去某个历史日期（比如t-30）的连续缺席次数。
如果这可以在SQL中完成，那就太好了！否则，我会将其加载到pandas dataframe中，并对旋转/分组开放以完成此操作。
谢谢你！
样品输入

id      student_id      present      date
0       1               0            4-28-2023
1       1               0            4-27-2023
2       1               1            4-26-2023
3       2               0            4-28-2023
4       2               1            4-27-2023
5       2               0            4-26-2023
6       3               1            4-28-2023
7       3               0            4-27-2023
8       3               0            4-26-2023

输出

student_id     ConsecutiveAbsense
1              2
2              1
3              0

pandas

来源：https://stackoverflow.com/questions/76149140/count-consecutive-occurrences-in-sql-and-or-pandas

3条答案

按热度按时间

thigvfpy1#

假设日期已经排序，使用自定义的groupby.agg：

out = (
 df.groupby('student_id')['present']
   .agg(lambda s: s.eq(0).cummin().sum())
)

输出：

student_id
1    2
2    1
3    0
Name: present, dtype: int64

作为DataFrame：

out = (
 df.groupby('student_id', as_index=False)
   .agg(ConsecutiveAbsense=('present', lambda s: s.eq(0).cummin().sum()))
)

输出：

student_id  ConsecutiveAbsense
0           1                   2
1           2                   1
2           3                   0

赞(0）回复(0）举报 2023-05-05

kqhtkvqz2#

如果这可以在SQL中完成，那就太好了！
在SQL中，这是一个缺口和孤岛问题。你想要每个学生最近一次缺席的时间长度（即最近一次的孤岛）。
下面是一种方法：

select *
from (
    select student_id, 
        count(*) - 1 streak_length,
        row_number() over(partition by student_id order by grp) rn
    from (
        select a.*,
            sum(presence) over(partition by student_id order by date desc) grp
        from class_absences a
    ) a
    group by student_id, grp
    having count(*) > 1
) a
where rn = 1

最内部的子查询使用窗口和定义岛，窗口和随存在的每一天（grp）递增。下一级使用group by和having聚合包含缺勤（即多个记录）的岛，并为每个学生（row_number）枚举它们-外部子查询过滤每个学生的最新条纹。
您没有标记RDBMS。这是大多数数据库都应该支持的标准SQL。另一方面，一些数据库具有使查询更整洁的扩展。

赞(0）回复(0）举报 2023-05-05

5t7ly7z53#

您可以使用条件max window函数查找学生出席的最后一个日期，然后使用该日期过滤掉前面的行：

select student_id, count(*)- case when max(present) = 0 then 0 else 1 end ConsecutiveAbsense -- or simply count(*)-max(present)
from
  (
   select *,
    max(case when present=1 then date_ else '1900-01-01' end) over (partition by student_id) last_present_date
   from tbl_name
  ) T
where date_ >= last_present_date
group by student_id
order by student_id

where date_ >= last_present_date将包括学生出席的最后日期，因此我们使用count(*)-case...排除该行，这只是为了包括计数为0的学生。当所有当前值都为0时，不需要排除该行，因此使用case表达式来处理。
输入数据的输出：

student_id  consecutiveabsense
1           2
2           1
3           0

Demo

赞(0）回复(0）举报 2023-05-05

我来回答

对SQL和/或Pandas中的连续出现进行计数

3条答案

相关问题

热门标签

最新问答