hadoop中的java代码

ewm0tg9j  于 2021-06-04  发布在  Hadoop
关注(0)|答案(3)|浏览(426)

我正在hadoop中运行一个仅Map的作业。数据集是单个文件中的一组html页面(由爬虫程序返回)
Map器代码是用java编写的。我使用jsoup来解析。我想要的输出是一个既有title标签内容又有meta标签内容的键。理想情况下,我应该得到1592我的Map输出记录记录。我拿到3184了。
我试图用这行代码进行的连接没有发生。

String MN_Job = (jobT + "\t" + jobsDetail);

取而代之的是每一个单独的输出,因此输出的数量增加了一倍。我做错什么了?

public class JobsDataMapper extends Mapper<LongWritable, Text, Text, Text> {

    private Text keytext = new Text();
    private Text valuetext = new Text();

    @Override
    public void map(LongWritable key, Text value, Context context)
    throws IOException, InterruptedException {

        String line = value.toString();

        Document doc = Jsoup.parse(line);
        Elements desc = doc.select("head title, meta[name=twitter:description]");

        for (Element jobhtml : desc) {
            Elements title = jobhtml.select("title");
            String jobT = "";
            for (Element titlehtml : title) {
                jobT = titlehtml.text();
            }

            Elements meta  = jobhtml.select("meta[name=twitter:description]"); 
            String jobsDetail ="";

            for (Element metahtml : meta) {
                String content = metahtml.attr("content");
                String content1 = content.replaceAll("\\p{Punct}+", " ");
                jobsDetail = content1.replaceAll(" (?i)a | (?i)able | (?i)about | (?i)across | (?i)after | (?i)all | (?i)almost | (?i)also | (?i)am | (?i)among | (?i)an | (?i)and | (?i)any | (?i)are | (?i)as | (?i)at | (?i)be | (?i)because | (?i)been | (?i)but | (?i)by | (?i)can | (?i)cannot | (?i)could | (?i)dear | (?i)did | (?i)do | (?i)does | (?i)either | (?i)else | (?i)ever | (?i)every | (?i)for | (?i)from | (?i)get | (?i)got | (?i)had | (?i)has | (?i)have | (?i)he | (?i)her | (?i)hers | (?i)him | (?i)his | (?i)how | (?i)however | (?i)i | (?i)if | (?i)in | (?i)into | (?i)is | (?i)it | (?i)its | (?i)just | (?i)least | (?i)let | (?i)like | (?i)likely | (?i)may | (?i)me | (?i)might | (?i)most | (?i)must | (?i)my | (?i)neither | (?i)no | (?i)nor | (?i)not | (?i)nbsp | (?i)of | (?i)off | (?i)often | (?i)on | (?i)only | (?i)or | (?i)other | (?i)our | (?i)own | (?i)rather | (?i)said | (?i)say | (?i)says | (?i)she | (?i)should | (?i)since | (?i)so | (?i)some | (?i)than | (?i)that | (?i)the | (?i)their | (?i)them | (?i)then | (?i)there | (?i)these | (?i)they | (?i)this | (?i)tis | (?i)to | (?i)too | (?i)twas | (?i)us | (?i)wants | (?i)was | (?i)we | (?i)were | (?i)what | (?i)when | (?i)where | (?i)which | (?i)while | (?i)who | (?i)whom | (?i)why | (?i)will | (?i)with | (?i)would | (?i)yet | (?i)you | (?i)your "," ");
            }

            String IT_Job = (jobT + "\t" + jobsDetail);

            keytext.set(IT_Job) ;
            valuetext.set("JobDetail");
            context.write( keytext, valuetext );        
        }
    }
}
jum4pzuy

jum4pzuy1#

我对原始代码进行了更改,删除了不必要的循环。以前的代码是这样的:当记录中有标题时,它就被输出,当有内容时,它也被输出。因此,每个html文件有两次写入。

public class JobsDataMapper extends Mapper<LongWritable, Text, Text, Text> {
        private Text keytext = new Text();
        private Text valuetext = new Text();
        private String jobT = new String();
        private String jobName= new String();

         @Override
        public void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {

 String line = value.toString();

 Document doc = Jsoup.parse(line);
 Elements desc = doc.select("head title, meta[name=twitter:description]");

 for (Element jobhtml : desc){
 Elements title = jobhtml.select("title");

 String jobTT = title.text();
 jobT =jobTT ;

 if (jobT.length()> 0){

 jobName=jobTT;
 }

 Elements meta = jobhtml.select("meta[name=twitter:description]");
 String jobsDetail ="";

 String content = meta.attr("content");

 String content1 = content.replaceAll("\\p{Punct}+", " ");
 jobsDetail = content1.toLowerCase();
 jobsDetail = content1.replaceAll(" a| able | about | across | after | all | almost | also | am | among | an | and | any | are | as | at | be| because | been | but | by | can | cannot | could | dear | did | do | does | either | else | ever | every | for | from | get | got | had | has | have | he | her | hers | him | his | how | however | i | if | in | into | is | it | its | just | least | let | like | likely | may | me | might | most | must | my | neither | no | nor | not | nbsp | of | off | often | on | only | or | other | our | own | rather | said | say | says | she | should | since | so | some | than | that | the | their | them | then | there | these | they | this | tis | to | too | twas | us | wants | was | we | were | what | when | where | which | while | who | whom | why | will | with | would | yet | you | your "," ");

     if (jobsDetail.length()>0) {

String MN_Job = (jobName+ "\t" + jobsDetail);

keytext.set(MN_Job) ;
valuetext.set("JobInIT");
context.write( keytext, valuetext );

 }

 }

 }

 }
wbgh16ku

wbgh16ku2#

看看数字: 3184/2 = 1592 .
我想你的文件只是在输入文件夹中被复制了。我不能确定,因为您还没有给出如何提交作业的代码,但也许您可以用一个简单的:

bin/hadoop fs -ls /your/input_path

提交时,请确保其中只有一个文件,或者只引用提交逻辑中的单个文件。

j91ykkif

j91ykkif3#

编辑:我知道问题出在哪里。但问题是,解决方案在mapreduce中可能并不明显。你可能得写下你的习惯 RecordReader . 让我解释一下这个问题。
在你的代码里你逐行阅读。然后你把这个应用到你读到的行上:

Elements desc = doc.select("head title, meta[name=twitter:description]");

但显然,它可能只有一个 title 或者 <meta name=twitter:description> 标签。所以你读了其中的一个然后储存起来。另一个保持空白。所以每次,只有一个变量, jobT 以及 jobsDetail 有任何数据。对于代码片段:

String IT_Job = (jobT + "\t" + jobsDetail);

一次,第一次为空,第二次,另一次为空。所以如果你期待 n 记录,你拿到了吗 2n 记录。类似地,如果尝试提取三个字段,那么应该 3n 记录。因此,您可以通过提取另一个字段来测试这个理论,然后检查您是否获得了预期记录数的三倍。
如果理论证明是正确的,您可能需要用特定的分隔符字符串对提取的网页进行分隔。那么你想写一个自定义的 RecordReader 它将根据分隔符一次读取一个html文件,然后一次处理整个html文件。这样你就能得到 title 以及 meta 标记在一起。

相关问题