目前我正在处理大量的输入(twitter),并尝试使用apachehive运行一些基本的情绪分析。但是,我不知道如何比较tweetids和body字符串的表。我将尽力解释如下:
我有两个名为twitterloc和twitterno的外部存储表,然后是:
CREATE EXTERNAL TABLE dict (word text, score int)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES("cassandra.ks.name"="myKeyspace", "cassandra.port"=9160);
--目标表
DROP TABLE IF EXISTS results;
CREATE EXTERNAL TABLE results(tweetid string, score int)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES("cassandra.ks.name"="twitterverse")
--将表及其相关信息连接到一个表中
DROP IF EXISTS twitter;
CREATE TABLE TWITTER(tweetid string, body string)
STORED AS SEQUENCEFILE;
INSERT OVERWRITE TABLE twitter
SELECT tweetid, body
FROM twitterLoc;
INSERT INTO TABLE twitter
select tweetid, body
from twitterNo;
在此,我想完成以下工作:
将每条tweet(twitter表中称为body)分成单独的单词进行比较。
把这些单词和我的字典比较一下,得出“分数”
再次按tweetid对分数进行分组,我就是这样尝试的:
--Compare to dictionary
DROP TABLE IF EXISTS twitterSplit;
CREATE TABLE twitterSplit(tweetid string, word text)
STORED AS SEQUENCEFILE
INSERT OVERWRITE TABLE twitterSplit
SELECT tweetid, SPLIT(body, " ")
FROM twitter;
DROP TABLE IF EXISTS scoreTable
CREATE TABLE scoreTable(tweetid string, word text, score int)
STORED AS SEQUENCEFILE;
INSERT OVERWRITE TABLE scoreTable
Select twitterSplit.tweetid, twitterSplit.word, Dict.score
FROM twitterSplit JOIN Dict WHERE word = word;
--Report Scores
INSERT OVERWRITE TABLE results
SELECT tweetid, SUM(score) by tweetid
FROM ScoreTable
GROUP BY tweetid;
暂无答案!
目前还没有任何答案,快来回答吧!