dse4.0.1:配置单元计数与cassandra计数不同

rsl1atfo  于 2021-06-03  发布在  Hadoop
关注(0)|答案(3)|浏览(274)

我们正在运行DataStaxEnterprise4.0.1,在向cassandra中插入行,然后查询hive中的 COUNT(1) .
设置:dse 4.0.01,cassandra 2.0,hive,全新集群。在cassandra中插入10000行,然后:

cqlsh:pageviews> select count(1) from pageviews_v1 limit 100000;

 count
-------
 10000

(1 rows)

cqlsh:pageviews>

但是从Hive:

hive> select count(1) from pageviews_v1 limit 100000;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201403272330_0002, Tracking URL = http://ip:50030/jobdetails.jsp?jobid=job_201403272330_0002
Kill Command = /usr/bin/dse hadoop job  -kill job_201403272330_0002
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2014-03-27 23:38:22,129 Stage-1 map = 0%,  reduce = 0%
<snip>
2014-03-27 23:38:49,324 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.31 sec
MapReduce Total cumulative CPU time: 11 seconds 310 msec
Ended Job = job_201403272330_0002
MapReduce Jobs Launched:
Job 0: Map: 4  Reduce: 1   Cumulative CPU: 11.31 sec   HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 310 msec
OK
1723
Time taken: 38.634 seconds, Fetched: 1 row(s)

只有1723排。我很困惑。cql3列族定义为:

CREATE TABLE pageviews_v1 (
  website text,
  date text,
  created timestamp,
  browser_id text,
  ip text,
  referer text,
  user_agent text,
  PRIMARY KEY ((website, date), created, browser_id)
) WITH CLUSTERING ORDER BY (created DESC, browser_id ASC) AND
  bloom_filter_fp_chance=0.001000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=1.000000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='NONE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND
  compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'};

在Hive里我们有:

CREATE EXTERNAL TABLE pageviews_v1(
  website string COMMENT 'from deserializer',
  date string COMMENT 'from deserializer',
  created timestamp COMMENT 'from deserializer',
  browser_id string COMMENT 'from deserializer',
  ip string COMMENT 'from deserializer',
  referer string COMMENT 'from deserializer',
  user_agent string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.cassandra.cql3.serde.CqlColumnSerDe'
STORED BY
  'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
WITH SERDEPROPERTIES (
  'serialization.format'='1',
  'cassandra.columns.mapping'='website,date,created,browser_id,ip,referer,ua')
LOCATION
  'cfs://ip/user/hive/warehouse/pageviews.db/pageviews_v1'
TBLPROPERTIES (
  'cassandra.partitioner'='org.apache.cassandra.dht.Murmur3Partitioner',
  'cassandra.ks.name'='pageviews',
  'cassandra.cf.name'='pageviews_v1',
  'auto_created'='true')

其他人也有类似的经历吗?

mpgws1up

mpgws1up1#

根据本文档,可能是配置单元表上的一致性设置。

pinkon5k

pinkon5k2#

将配置单元查询更改为“select count(*)from pageviews\u v1;”

camsedfj

camsedfj3#

问题似乎在于按顺序进行聚类。从配置单元中删除可解决计数误报的。

相关问题