distinct用户在cassandra中，如何做到这一点？

xesrikrc 于 2021-06-07 发布在 Kafka

关注(0)|答案(2)|浏览(382)

我正在scala中开发一个大数据应用程序。
我使用Kafka，Spark（Kafka流）和Cassandra作为存储。
我有一个应用程序以外的Spark要求Cassandra显示统计结果，如下载数量。
我在用户统计方面有问题。
我需要按publisher\u id、publisher\u id+app\u id甚至所有publisher\u id计算一段时间内（可能是1天、6天、7天、一个月或任何时间）的唯一用户数。
我需要在现场计数，因为我不知道什么是用户选择的时期。
会话用户的原始数据是：

CREATE TABLE tests2.raw_sessions (
date_event timeuuid,
    year int,
    month int,
    day int,
    hour int,
    publisher_id uuid,
    app_id text,
    user_id text,
     session_id text,
     PRIMARY KEY (date_event, year, month, day, hour, publisher_id, app_id, 
     user_id, session_id)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC, publisher_id ASC, app_id ASC, user_id ASC, session_id ASC)

我创建了多个表，并在cassandra中尝试了很多东西。我尝试在cassandra中使用distinct关键字，但它只与static column一起使用（但不是static column），而且它可以是表中唯一的分区键（我需要根据日期和发布者id、app\u id进行筛选）
我想用postgres数据库，但Kafka流这不是真正的最佳，不是吗？
我应该用什么方法来解决这个问题？

scala apache-kafka apache-spark cassandra-2.0

来源：https://stackoverflow.com/questions/40861416/distinct-user-in-cassandra-how-to-do-that

2条答案

按热度按时间

to94eoyn1#

要求你绝对要有超精确的计数数据。如果不使用像hyperloglog这样的估计数据结构，会有很大的帮助。

赞(0）回复(0）举报 2021-06-07

efzxgjgh2#

在cassandra数据建模中，数据复制非常有用。cassandra是写密集型数据库。写字很便宜。在对数据建模时，始终要考虑单个查询是什么。

Uniq users list for a period by publisher_id

如果你看到了，你有三个任务的要求。

1. Unique users by publisher id for a perieod of day.
2. Unique users by publisher id for a perieod of month.
3. Unique users by publisher id for a perieod of year.

更好的方法是创建三个不同的表

CREATE TABLE users_by_year(
year int,
month int,
day int,
hour int,
publisher_id uuid,
app_id text,
user_id text,
session_id text,
PRIMARY KEY ((year,publisher_id),user_id )
)WITH CLUSTERING ORDER BY(user_id  DESC)

CREATE TABLE users_by_month(
year int,
month int,
day int,
hour int,
timestamp int,
publisher_id uuid,
app_id text,
user_id text,
session_id text,
PRIMARY KEY ((month ,publisher_id),user_id)
)WITH CLUSTERING ORDER BY( user_id DESC);

CREATE TABLE users_by_day(
year int,
month int,
day int,
hour int,
timestamp int,
publisher_id uuid,
app_id text,
user_id text,
session_id text,
PRIMARY KEY ((day,publisher_id),user_id)
)WITH CLUSTERING ORDER BY( user_id DESC);

ase模型将为publisher id保留年、月、日的唯一用途。

Uniq users filetr by publisher_id

CREATE TABLE users_by_publisherid(
year int,
month int,
day int,
hour int,
timestamp int,
publisher_id uuid,
app_id text,
user_id text,
session_id text,
PRIMARY KEY (publisher_id,user_id)
)WITH CLUSTERING ORDER BY( user_id DESC);

此表将为publisher\u id保留唯一的用户。

赞(0）回复(0）举报 2021-06-07

我来回答

distinct用户在cassandra中，如何做到这一点？

2条答案

相关问题

热门标签

最新问答