cassandra表设计

vjrehmav  于 2021-06-15  发布在  Cassandra
关注(0)|答案(2)|浏览(482)

我有如下数据:

select * from test;

 department | employee | batch_number | hash
------------+----------+--------------+-------
 dep1       | Bart     |            1 | hash1
 dep1       | Bart     |            1 | hash2
 dep1       | Lisa     |            3 | hash3
 dep1       | Lisa     |            4 | hash4
 dep1       | John     |            5 | hash5
 dep1       | Lucy     |            6 | hash6
 dep1       | Bart     |            7 | hash7
 dep1       | Bart     |            7 | hash8

我想用一个 where 条款 batch_number ,安 orderingbatch_number 和一个 in 上的 predicate employee .
在关系数据库中

select * from test 
  where department='dep1' 
  and employee in ('Bart','Lucy','John') 
  and batch_number >= 2 
  order by batch_number desc 
  limit 3;

 department | employee | batch_number | hash
------------+----------+--------------+-------
 dep1       | Bart     |            7 | hash7
 dep1       | Bart     |            7 | hash8
 dep1       | Lucy     |            6 | hash6

我在cassandra中为这个查询建模一个表时遇到了一些问题。 department 将是我的分区密钥 hash 必须是主键的一部分。但我正在努力使用集群键和/或(sstable-attached)辅助索引。
因为我要点菜 batch_number 我尝试将其作为群集密钥:

CREATE TABLE keyspace.test(
    department      TEXT,
    batch_number    INT,
    hash            TEXT,
    employee        TEXT,
    PRIMARY KEY ((department), batch_number, hash)
) WITH CLUSTERING ORDER BY (batch_number DESC);
CREATE INDEX tst_emp ON keyspace.test (employee);

但这不允许使用 in 我的索引上的 predicate :

select * from keyspace.test where department='dep1' and employee in ('Bart','Lucy','John');
InvalidRequest: Error from server: code=2200 [Invalid query] message="IN predicates on non-primary-key columns (employee) is not yet supported"

所以我试着加入 employee 列作为群集键:

CREATE TABLE keyspace.test(
    department      TEXT,
    batch_number    INT,
    hash            TEXT,
    employee        TEXT,
    PRIMARY KEY ((department), batch_number, hash, employee)
) WITH CLUSTERING ORDER BY (batch_number DESC);

但这失败了,因为我不能把一个非情商关系 batch_number :

select * from keyspace.test where department='dep1' and batch_number > 1 and employee in ('Bart','Lucy','John');
InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column "employee" cannot be restricted (preceding column "batch_number" is restricted by a non-EQ relation)"

但每当我把 employee 之前 batch_number 我失去了点餐的能力 batch_number :

CREATE TABLE keyspace.test(
    department      TEXT,
    employee        TEXT,
    batch_number    INT,
    hash            TEXT,
    PRIMARY KEY ((department), employee, batch_number, hash)
);

select * from keyspace.test where department='dep1' and employee in ('Bart','Lucy','John') ORDER BY batch_number DESC;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Order by currently only support the ordering of columns following their declared order in the PRIMARY KEY"

那么什么样的表设计会允许这样的查询呢?这能在Cassandra做到吗?
编辑:
我希望能够在此表上运行的其他查询包括:

select * from keyspace.test where department='X' and batch_number=Y

以及

delete from keyspace.test where department='X'
q9rjltbz

q9rjltbz1#

使用允许您重新排列数据的物化视图:

CREATE MATERIALIZED VIEW mv_test AS 
SELECT
   department,
   batch_number,
   employee,
   hash 
FROM
   test 
WHERE
   department IS NOT NULL 
   AND batch_number IS NOT NULL 
   AND employee IS NOT NULL 
   AND hash IS NOT NULL 
PRIMARY KEY (department, employee, batch_number, hash) 
WITH clustering 
ORDER BY
(batch_number DESC);

我可以执行以下查询:

SELECT * FROM mv_test 
WHERE
   department = 'dep1' 
   AND employee IN 
   (
      'Bart',
      'Lisa'
   )
   AND batch_number > 3;

结果按聚类顺序排列:

department | employee | batch_number | hash
------------+----------+--------------+-------
       dep1 |     Bart |            7 | hash7
       dep1 |     Bart |            7 | hash8
       dep1 |     Lisa |            4 | hash4

> 条款是不平等条款, IN ,虽然有多个值,但仍然是确定性的,这就是为什么我相信您可以毫无问题地筛选键。自 batch_number 是您最不希望筛选的内容,任何类型的where子句都是允许的。我想你一直 department .
请注意,物化视图会影响性能。更具体地说,写性能。但是,读取性能相对于 ALLOW FILTERING .
更新:
在物化视图末尾指定的顺序是 batch_number 但是,它将首先在 department ,那么 employee ,然后 batch_number ,所以 batch_number 具体来说是不保证的。据我所知,这是没有办法的。另一种数据库解决方案可能更可取。
更新2:
如apache邮件链中所述(参见下面的注解),物化视图并不被视为生产就绪。然而,datastax认为它们是可用的,前提是它们使用了上述最佳实践。就我个人而言,我对物化视图没有任何问题。当然,这是一个简单的单一数据中心集群,考虑到最佳实践中提到了更复杂的设置,它们在这种情况下可能会崩溃。

w1e3prcc

w1e3prcc2#

您可以在 employee 如果需要,甚至可以从主键中删除它。你需要停止使用 IN 但是您可以拆分查询并在客户端连接结果。

CREATE TABLE tk.test_good(
    department      TEXT,
    batch_number    INT,
    employee        TEXT,
    hash            TEXT,
    PRIMARY KEY ((department), batch_number, hash)
)WITH CLUSTERING ORDER BY (batch_number DESC);

CREATE INDEX IF NOT EXISTS employee_idx ON tk.test_good ( employee );

select * from tk.test_good where department='dep1' and employee='Bart' and batch_number >= 2 limit 3;
select * from tk.test_good where department='dep1' and employee='Lucy' and batch_number >= 2 limit 3;
select * from tk.test_good where department='dep1' and employee='John' and batch_number >= 2 limit 3;

这种方法的缺点是索引可能会变得太大。但我不知道数据池的大小,所以我将留给您考虑。

相关问题