使用mapreduce删除包含特定单词的整个句子

xwbd5t1u  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(341)

我正在学习mapreduce,我想读一个输入文件(一句一句),并且只在输出文件中不包含单词“snake”时才将每个句子写入输出文件。
e、 g.输入文件:

This is my first sentence. This is my first sentence.
This is my first sentence.

The snake is an animal. This is the second sentence. This is my third sentence.

Another sentence. Another sentence with snake.

那么输出文件应该是:

This is my first sentence. This is my first sentence.
This is my first sentence.

This is the second sentence. This is my third sentence.

Another sentence.

为此,我检查,在 map 方法,如果句子( value )包含单词snake。如果这个句子没有蛇形词,我就把这个句子写在 context .
此外,我将reducer任务的数量设置为0,否则在输出文件中,我将以随机顺序获得句子(例如,第一句话、第三句话、第二句话等等)。
我的代码确实正确地过滤了带有snake单词的句子,但问题是它将每个句子写在新行中,如下所示:

This is my first sentence. 
 This is my first sentence. 

This is my first sentence. 
 This is the second sentence. 
 This is my third sentence. 

Another sentence. 

.

如果一个句子出现在输入文本的新行中,我如何才能在新行中写一个句子?以下是我的代码:

public class RemoveSentence {

    public static class SentenceMapper extends Mapper<Object, Text, Text, NullWritable>{

        private Text removeWord = new Text ("snake");

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            if (!value.toString().contains(removeWord.toString())) {
                Text currentSentence = new Text(value.toString()+". ");
                context.write(currentSentence, NullWritable.get());
            }
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        conf.set("textinputformat.record.delimiter", ".");

        Job job = Job.getInstance(conf, "remove sentence");
        job.setJarByClass(RemoveSentence.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setMapperClass(SentenceMapper.class);
        job.setNumReduceTasks(0);

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

这个和这个其他的解决方案说,应该是足够的设置 context.write(word, null); 但在我的情况下没有起作用。
还有一个问题与 conf.set("textinputformat.record.delimiter", "."); . 好吧,这就是我如何定义句子之间的分隔符,正因为如此,有时输出文件中的句子以空格开头(例如,第二个空格) This is my first sentence. ). 作为替代,我试着把它设置成这样 conf.set("textinputformat.record.delimiter", ". "); (句号后有空格)但这样java应用程序就不会在输出文件中写入所有句子。

zzwlnbp8

zzwlnbp81#

你就快解决问题了。想想你的mapreduce程序是如何工作的。map方法将每个由“.”分隔的句子(默认值是newline)作为新值,然后将其写入文件。您需要一个属性,该属性禁止在每次map()调用后写入换行符。我不确定,但我认为不存在这样的财产。
一个解决方法是让它正常处理。示例记录为: This is first sentence. This is second snake. This is last. 找到单词“snake”,如果找到了,立即删除前一个“.”后的所有内容到下一个“.”打包新字符串并将其写入上下文。
当然,如果您能找到一种方法来禁用map()调用后的换行符,那么这将是最简单的方法。
希望这有帮助。

相关问题