我正在hadoop中运行一个仅Map的作业。数据集是单个文件中的一组html页面(由爬虫程序返回)
Map器代码是用java编写的。我使用jsoup来解析。我想要的输出是一个既有title标签内容又有meta标签内容的键。理想情况下,我应该得到1592我的Map输出记录记录。我拿到3184了。
我试图用这行代码进行的连接没有发生。
String MN_Job = (jobT + "\t" + jobsDetail);
取而代之的是每一个单独的输出,因此输出的数量增加了一倍。我做错什么了?
public class JobsDataMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text keytext = new Text();
private Text valuetext = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
Document doc = Jsoup.parse(line);
Elements desc = doc.select("head title, meta[name=twitter:description]");
for (Element jobhtml : desc) {
Elements title = jobhtml.select("title");
String jobT = "";
for (Element titlehtml : title) {
jobT = titlehtml.text();
}
Elements meta = jobhtml.select("meta[name=twitter:description]");
String jobsDetail ="";
for (Element metahtml : meta) {
String content = metahtml.attr("content");
String content1 = content.replaceAll("\\p{Punct}+", " ");
jobsDetail = content1.replaceAll(" (?i)a | (?i)able | (?i)about | (?i)across | (?i)after | (?i)all | (?i)almost | (?i)also | (?i)am | (?i)among | (?i)an | (?i)and | (?i)any | (?i)are | (?i)as | (?i)at | (?i)be | (?i)because | (?i)been | (?i)but | (?i)by | (?i)can | (?i)cannot | (?i)could | (?i)dear | (?i)did | (?i)do | (?i)does | (?i)either | (?i)else | (?i)ever | (?i)every | (?i)for | (?i)from | (?i)get | (?i)got | (?i)had | (?i)has | (?i)have | (?i)he | (?i)her | (?i)hers | (?i)him | (?i)his | (?i)how | (?i)however | (?i)i | (?i)if | (?i)in | (?i)into | (?i)is | (?i)it | (?i)its | (?i)just | (?i)least | (?i)let | (?i)like | (?i)likely | (?i)may | (?i)me | (?i)might | (?i)most | (?i)must | (?i)my | (?i)neither | (?i)no | (?i)nor | (?i)not | (?i)nbsp | (?i)of | (?i)off | (?i)often | (?i)on | (?i)only | (?i)or | (?i)other | (?i)our | (?i)own | (?i)rather | (?i)said | (?i)say | (?i)says | (?i)she | (?i)should | (?i)since | (?i)so | (?i)some | (?i)than | (?i)that | (?i)the | (?i)their | (?i)them | (?i)then | (?i)there | (?i)these | (?i)they | (?i)this | (?i)tis | (?i)to | (?i)too | (?i)twas | (?i)us | (?i)wants | (?i)was | (?i)we | (?i)were | (?i)what | (?i)when | (?i)where | (?i)which | (?i)while | (?i)who | (?i)whom | (?i)why | (?i)will | (?i)with | (?i)would | (?i)yet | (?i)you | (?i)your "," ");
}
String IT_Job = (jobT + "\t" + jobsDetail);
keytext.set(IT_Job) ;
valuetext.set("JobDetail");
context.write( keytext, valuetext );
}
}
}
3条答案
按热度按时间jum4pzuy1#
我对原始代码进行了更改,删除了不必要的循环。以前的代码是这样的:当记录中有标题时,它就被输出,当有内容时,它也被输出。因此,每个html文件有两次写入。
wbgh16ku2#
看看数字:
3184/2 = 1592
.我想你的文件只是在输入文件夹中被复制了。我不能确定,因为您还没有给出如何提交作业的代码,但也许您可以用一个简单的:
提交时,请确保其中只有一个文件,或者只引用提交逻辑中的单个文件。
j91ykkif3#
编辑:我知道问题出在哪里。但问题是,解决方案在mapreduce中可能并不明显。你可能得写下你的习惯
RecordReader
. 让我解释一下这个问题。在你的代码里你逐行阅读。然后你把这个应用到你读到的行上:
但显然,它可能只有一个
title
或者<meta name=twitter:description>
标签。所以你读了其中的一个然后储存起来。另一个保持空白。所以每次,只有一个变量,jobT
以及jobsDetail
有任何数据。对于代码片段:一次,第一次为空,第二次,另一次为空。所以如果你期待
n
记录,你拿到了吗2n
记录。类似地,如果尝试提取三个字段,那么应该3n
记录。因此,您可以通过提取另一个字段来测试这个理论,然后检查您是否获得了预期记录数的三倍。如果理论证明是正确的,您可能需要用特定的分隔符字符串对提取的网页进行分隔。那么你想写一个自定义的
RecordReader
它将根据分隔符一次读取一个html文件,然后一次处理整个html文件。这样你就能得到title
以及meta
标记在一起。