sql—在mysql窗口函数中从第一行、最后一行和聚合中获取值

n1bvdmb6  于 2021-07-26  发布在  Java
关注(0)|答案(2)|浏览(313)

对于市场营销相关的分析,我需要提供关于第一个和最后一个接触点的数据以及与我们网站的总互动次数。
简化版的 interaction 表如下所示:

create table interaction (
    id varchar(36) primary key,
    session_id varchar(36) not null,
    timestamp timestamp(3) not null,
    utm_source varchar(255) null,
    utm_medium varchar(255) null
)

我们目前的做法如下:

with interaction_ordered as (
    select *, 
           row_number() over (partition by session_id order by timestamp asc) as row_num_asc,
           row_number() over (partition by session_id order by timestamp desc) as row_num_desc
    from interaction
)

select first_interaction.session_id as session_id,
       first_interaction.timestamp as session_start,
       timestampdiff(SECOND, first_interaction.timestamp, last_interaction.timestamp) as session_duration,
       count(*) as interaction_count,
       first_interaction.utm_source as first_touchpoint,
       last_interaction.utm_source as last_touchpoint,
       last_interaction.utm_medium as last_medium
from interaction_ordered as interaction
join interaction_ordered as first_interaction using (session_id)
join interaction_ordered as last_interaction using (session_id)
where first_interaction.row_num_asc = 1 and last_interaction.row_num_desc = 1
group by session_id
having session_start between ? - interval 1 day and ? + interval 1 day

目前,我们观察到运行时与我们的数据近似线性,这将很快变得不可行。
另一个想法是

select session_id,
       min(timestamp) as session_start,
       timestampdiff(
           SECOND,
           min(timestamp),
           max(timestamp)
       ) as session_duration,
       count(*) as interaction_count,
       first_value(utm_source) over (partition by session_id order by timestamp) as first_touchpoint,
       first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint,
       first_value(utm_medium) over (partition by session_id order by timestamp desc) as last_medium
from interaction
group by session_id
having session_start between ? - interval 1 day and ? + interval 1 day

但在我们的实验中,我们从未看到第二个查询完成。因此,我们不能100%肯定它会产生相同的结果。
我们试过了 timestamp 以及 (session_id, timestamp) ,但根据 EXPLAIN 这并没有改变查询计划。
有没有快速的方法从每个会话id的第一个和最后一个条目加上每个会话id的计数中检索单个属性?请注意,在我们的实际示例中,有更多类似于 utm_source 以及 utm_medium 我们感兴趣的。
编辑
样本数据:

insert into interaction values
('a', 'session_1', '2020-06-15T12:00:00.000', 'search.com', 'search'),
('b', 'session_1', '2020-06-15T12:01:00.000', null, null),
('c', 'session_1', '2020-06-15T12:01:30.000', 'social.com', 'social'),
('d', 'session_1', '2020-06-15T12:02:00.250', 'ads.com', 'ads'),

('e', 'session_2', '2020-06-15T14:00:00.000', null, null),
('f', 'session_2', '2020-06-15T14:12:00.000', null, null),
('g', 'session_2', '2020-06-15T14:25:00.000', 'social.com', 'social'),

('h', 'session_3', '2020-06-16T12:05:00.000', 'ads.com', 'ads'),
('i', 'session_3', '2020-06-16T12:05:01.000', null, null),

('j', 'session_4', '2020-06-15T12:00:00.000', null, null),
('k', 'session_5', '2020-06-15T12:00:00.000', 'search.com', 'search');

预期结果:

session_id, session_start, session_duration, interaction_count, first_touchpoint, last_touchpoint, last_medium
session_1, 2020-06-15T12:00:00.000, 120, 4, search.com, ads.com, ads
session_2, 2020-06-15T14:00:00.000, 1500, 3, null, social.com, social
session_3, 2020-06-16T12:05:00.000, 1, 2, ads.com, null, null
session_4, 2020-06-15T12:00:00.000, 0, 1, null, null, null
session_5, 2020-06-15T12:00:00.000, 0, 1, search.com, search.com, search

我注意到我的第二个查询没有产生预期的结果。这个 last_touchpoint 以及 last_medium 而是用第一个值填充。我试过了 first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint, 以及 last_value(utm_source) over (partition by session_id order by timestamp range between unbounded preceding and unbounded following) as last_touchpoint,

ih99xse1

ih99xse11#

使查询可伸缩的唯一方法是使用 where 条款。如果我假设会话的持续时间不会超过一天,那么我可以将计算的时间范围扩展一天,并使用窗口函数。结果是这样的:

select s.*
from (select i.*,
             min(timestamp) over (partition by session_id) as session_start,
             count(*) over (partition by session_id) as interaction_count,
             first_value(utm_source) over (partition by session_id order by timestamp) as first_touchpoint,
             first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint,
             first_value(utm_medium) over (partition by session_id order by timestamp desc) as last_medium
      from interaction i
      where timestamp between ? - interval 2 day and ? + interval 2 day
     ) s
where timestamp = session_start and
      session_start between ? - interval 1 day and ? + interval 1 day;

你使用 first_value() 应该返回一个错误——它违反了mysql 8+默认设置的“full group by”规则。语法错误的代码不起作用也就不足为奇了。

bqjvbblv

bqjvbblv2#

WITH cte AS ( SELECT *,
                     FIRST_VALUE(utm_source) OVER (PARTITION BY session_id ORDER BY `timestamp` ASC) first_touchpoint,
                     FIRST_VALUE(utm_source) OVER (PARTITION BY session_id ORDER BY `timestamp` DESC) last_touchpoint,
                     FIRST_VALUE(utm_medium) OVER (PARTITION BY session_id ORDER BY `timestamp` DESC) last_medium
              FROM interaction
            )
SELECT session_id,
       MIN(`timestamp`) session_start,
       TIMESTAMPDIFF(SECOND, MIN(`timestamp`), MAX(`timestamp`)) session_duration,
       COUNT(*) interaction_count,
       ANY_VALUE( first_touchpoint ) first_touchpoint,
       ANY_VALUE( last_touchpoint ) last_touchpoint,
       ANY_VALUE( last_medium ) last_medium
FROM cte
GROUP BY session_id;

小提琴

相关问题