对于市场营销相关的分析,我需要提供关于第一个和最后一个接触点的数据以及与我们网站的总互动次数。
简化版的 interaction
表如下所示:
create table interaction (
id varchar(36) primary key,
session_id varchar(36) not null,
timestamp timestamp(3) not null,
utm_source varchar(255) null,
utm_medium varchar(255) null
)
我们目前的做法如下:
with interaction_ordered as (
select *,
row_number() over (partition by session_id order by timestamp asc) as row_num_asc,
row_number() over (partition by session_id order by timestamp desc) as row_num_desc
from interaction
)
select first_interaction.session_id as session_id,
first_interaction.timestamp as session_start,
timestampdiff(SECOND, first_interaction.timestamp, last_interaction.timestamp) as session_duration,
count(*) as interaction_count,
first_interaction.utm_source as first_touchpoint,
last_interaction.utm_source as last_touchpoint,
last_interaction.utm_medium as last_medium
from interaction_ordered as interaction
join interaction_ordered as first_interaction using (session_id)
join interaction_ordered as last_interaction using (session_id)
where first_interaction.row_num_asc = 1 and last_interaction.row_num_desc = 1
group by session_id
having session_start between ? - interval 1 day and ? + interval 1 day
目前,我们观察到运行时与我们的数据近似线性,这将很快变得不可行。
另一个想法是
select session_id,
min(timestamp) as session_start,
timestampdiff(
SECOND,
min(timestamp),
max(timestamp)
) as session_duration,
count(*) as interaction_count,
first_value(utm_source) over (partition by session_id order by timestamp) as first_touchpoint,
first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint,
first_value(utm_medium) over (partition by session_id order by timestamp desc) as last_medium
from interaction
group by session_id
having session_start between ? - interval 1 day and ? + interval 1 day
但在我们的实验中,我们从未看到第二个查询完成。因此,我们不能100%肯定它会产生相同的结果。
我们试过了 timestamp
以及 (session_id, timestamp)
,但根据 EXPLAIN
这并没有改变查询计划。
有没有快速的方法从每个会话id的第一个和最后一个条目加上每个会话id的计数中检索单个属性?请注意,在我们的实际示例中,有更多类似于 utm_source
以及 utm_medium
我们感兴趣的。
编辑
样本数据:
insert into interaction values
('a', 'session_1', '2020-06-15T12:00:00.000', 'search.com', 'search'),
('b', 'session_1', '2020-06-15T12:01:00.000', null, null),
('c', 'session_1', '2020-06-15T12:01:30.000', 'social.com', 'social'),
('d', 'session_1', '2020-06-15T12:02:00.250', 'ads.com', 'ads'),
('e', 'session_2', '2020-06-15T14:00:00.000', null, null),
('f', 'session_2', '2020-06-15T14:12:00.000', null, null),
('g', 'session_2', '2020-06-15T14:25:00.000', 'social.com', 'social'),
('h', 'session_3', '2020-06-16T12:05:00.000', 'ads.com', 'ads'),
('i', 'session_3', '2020-06-16T12:05:01.000', null, null),
('j', 'session_4', '2020-06-15T12:00:00.000', null, null),
('k', 'session_5', '2020-06-15T12:00:00.000', 'search.com', 'search');
预期结果:
session_id, session_start, session_duration, interaction_count, first_touchpoint, last_touchpoint, last_medium
session_1, 2020-06-15T12:00:00.000, 120, 4, search.com, ads.com, ads
session_2, 2020-06-15T14:00:00.000, 1500, 3, null, social.com, social
session_3, 2020-06-16T12:05:00.000, 1, 2, ads.com, null, null
session_4, 2020-06-15T12:00:00.000, 0, 1, null, null, null
session_5, 2020-06-15T12:00:00.000, 0, 1, search.com, search.com, search
我注意到我的第二个查询没有产生预期的结果。这个 last_touchpoint
以及 last_medium
而是用第一个值填充。我试过了 first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint,
以及 last_value(utm_source) over (partition by session_id order by timestamp range between unbounded preceding and unbounded following) as last_touchpoint,
2条答案
按热度按时间ih99xse11#
使查询可伸缩的唯一方法是使用
where
条款。如果我假设会话的持续时间不会超过一天,那么我可以将计算的时间范围扩展一天,并使用窗口函数。结果是这样的:你使用
first_value()
应该返回一个错误——它违反了mysql 8+默认设置的“full group by”规则。语法错误的代码不起作用也就不足为奇了。bqjvbblv2#
小提琴