我一直在尝试创建一个twitter hashtag count hadoop程序。我已经成功地提取了文本,得到了标签,并开始尝试计数。我最早遇到的一个问题是,许多hashtag非常相似(test,tests,tests!,等等)。
我首先清除字符串中的所有特殊字符并删除hashtag中的所有空格。但当出现“鹰”、“鹰”、“鹰”等情况时,问题依然存在。我在一个单独的类中实现了dice的系数算法,如下所示:
//Using Dice's Coefficient algorithm
public class WordSimilarity {
public static boolean isStringSimilar(String str1,String str2){
return doComparison(str1,str2)>=Analyzer.getSimilarity();
}
/**@return lexical similarity value in the range [0,1] */
private static double doComparison(String str1, String str2) {
// If the strings are too small, do not compare them at all.
try {
if(str1.length()>3 && str2.length()>3) {
ArrayList pairs1 = wordLetterPairs(str1.toUpperCase());
ArrayList pairs2 = wordLetterPairs(str2.toUpperCase());
int intersection = 0;
int union = pairs1.size() + pairs2.size();
for (int i = 0; i < pairs1.size(); i++) {
Object pair1 = pairs1.get(i);
for (int j = 0; j < pairs2.size(); j++) {
Object pair2 = pairs2.get(j);
if (pair1.equals(pair2)) {
intersection++;
pairs2.remove(j);
break;
}
}
}
return (2.0 * intersection) / union;
}
else{
return 0;
}
}catch(NegativeArraySizeException ex){
return 0;
}
}
/**@return an ArrayList of 2-character Strings. */
private static ArrayList wordLetterPairs(String str){
ArrayList allPairs = new ArrayList();
// Tokenize the string and put the tokens/words into an array
String[] words = str.split("\\s");
// For each word
for(int w=0; w<words.length;w++){
// Find the pairs of characters
String[] pairsInWord = letterPairs(words[w]);
for(int p=0;p<pairsInWord.length;p++){
allPairs.add(pairsInWord[p]);
}
}
return allPairs;
}
/**@return an array of adjacent letter pairs contained in the input string */
private static String[] letterPairs(String str){
int numPairs = str.length() -1;
String[] pairs = new String[numPairs];
for(int i=0; i<numPairs;i++){
pairs[i]=str.substring(i,i+2);
}
return pairs;
}
}
热释光;dr比较两个单词并返回一个介于0和1之间的数字,说明这些字符串的相似程度。
然后,我创建了一个定制的writeablecomparable(我打算在项目中使用它作为一个值,尽管它现在只是一个键):
public class Hashtag implements WritableComparable<Hashtag> {
private Text hashtag;
public Hashtag(){
this.hashtag = new Text();
}
public Hashtag(String hashtag) {
this.hashtag = new Text(hashtag);
}
public Text getHashtag() {
return hashtag;
}
public void setHashtag(String hashtag) {
// Remove characters that add no information to the analysis, but cause problems to the result
this.hashtag = new Text(hashtag);
}
public void setHashtag(Text hashtag) {
this.hashtag = hashtag;
}
// Compare To uses the WordSimilarity algorithm to determine if the hashtags are similar. If they are,
// they are considered equal
@Override
public int compareTo(Hashtag o) {
if(o.getHashtag().toString().equalsIgnoreCase(this.getHashtag().toString())){
return 0;
}else if(WordSimilarity.isStringSimilar(this.hashtag.toString(),o.hashtag.toString())){
return 0;
}else {
return this.hashtag.toString().compareTo(o.getHashtag().toString());
}
}
@Override
public String toString() {
return this.hashtag.toString();
}
@Override
public void write(DataOutput dataOutput) throws IOException {
this.hashtag.write(dataOutput);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
this.hashtag.readFields(dataInput);
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (!(o instanceof Hashtag)) return false;
Hashtag hashtag1 = (Hashtag) o;
return WordSimilarity.isStringSimilar(this.getHashtag().toString(),hashtag1.getHashtag().toString());
}
@Override
public int hashCode() {
return Objects.hash(getHashtag());
}
}
最后,编写了mapreduce代码:
public class HashTagCounter {
private final static IntWritable one = new IntWritable(1);
public static class HashtagCountMapper extends Mapper<Object, Text, Hashtag, IntWritable> {
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
//If the line does not start with '{', it is not a valid JSON. Ignore.
if (value.toString().startsWith("{")) {
Status tweet = null;
try {
//Create a status object from Raw JSON
tweet = TwitterObjectFactory.createStatus(value.toString());
if (tweet!=null && tweet.getText() != null) {
StringTokenizer itr = new StringTokenizer(tweet.getText());
while (itr.hasMoreTokens()) {
String temp = itr.nextToken();
//Check only hashtags
if (temp.startsWith("#") && temp.length()>=3 && LanguageChecker.checkIfStringIsInLatin(temp)){
temp = purifyString(temp);
context.write(new Hashtag('#'+temp), one);
}
}
}
} catch (TwitterException tex) {
System.err.println("Twitter Exception thrown: "+ tex.getErrorMessage());
}
}
}
}
public static class HashtagCountCombiner extends Reducer<Hashtag, IntWritable, Hashtag, IntWritable> {
private IntWritable result = new IntWritable();
@Override
public void reduce(Hashtag key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static class HashtagCountReducer extends Reducer<Hashtag, IntWritable, Hashtag, IntWritable> {
private IntWritable result = new IntWritable();
@Override
public void reduce(Hashtag key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
private static String purifyString(String s){
s = s.replaceAll(Analyzer.PURE_TEXT.pattern(),"").toLowerCase();
s = Normalizer.normalize(s, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "");
return s.trim();
}
}
请注意,所有的导入都是在代码中进行的,我只是把它们放在这里,以减少已经很繁重的文章。
代码正常运行,没有错误,而且基本上可以正常工作。我说的主要是,因为在part-r-0000文件中,我得到了如下几个条目:
密尔沃克2
xx<----其他字符串
密尔沃克1
xx<----其他字符串
密尔沃克7
等等。我在记事本上测试了这些字符串,它们看起来完全相同(我最初认为这可能是编码问题)。不是这样,原始文件中的所有此类hashtag都显示为utf8)。
并非所有的hashtag都会出现这种情况,但相当一部分hashtag会出现这种情况。理论上,我可以在输出上运行第二个mapreduce作业,并将它们正确地组合在一起,而不会带来麻烦(我们所说的是由10gb输入文件生成的100kb文件),但我认为这是对计算能力的浪费。
这让我相信我在mapreduce的工作原理中遗漏了一些东西。快把我逼疯了。谁能解释一下我做错了什么,我的逻辑错误在哪里?
1条答案
按热度按时间dy2hfwbg1#
我猜hashtag的实现导致了这个问题。当遇到utf-8字符序列中的双字节字符时,文本和字符串不同。此外,文本是可变的,而字符串不是可变的,而且字符串操作的预期行为可能与文本操作不同。。
所以只要从下面的链接中阅读4页[115,118](包括这两页)就可以打开一个指向hadoop最终指南的pdf文件。。
http://javaarm.com/file/apache/hadoop/books/hadoop-the.definitive.guide_4.edition_a_tom.white_april-2015.pdf
希望这篇文章能帮助你解决这个问题。。
谢谢。。