如何对多个输入文件仅使用一个Map?因为hadoop为一个文件创建了一个Map器。我只需要一个Map所有文件。
我试着用 CombineFileInputFormat
. 它有一个Map器,但Map输入只包含一个文件。我需要输入Map值来包含来自所有文件(文本格式)的数据,如下所示:
输入Map值:
来自file1.txt的数据
来自file2.txt的数据
来自file3.txt的数据
public class WholeFileInputFormat extends CombineFileInputFormat<NullWritable, Text> {
public WholeFileInputFormat() {
super();
setMaxSplitSize(67108864);
}
@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
@Override
public RecordReader<NullWritable, Text> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException {
if (!(split instanceof CombineFileSplit)) {
throw new IllegalArgumentException("split must be a CombineFileSplit");
}
RecordReader<NullWritable, Text> r = new CombineFileRecordReader<NullWritable, Text>((CombineFileSplit) split, context, WholeFileRecordReader.class);
return r;
//return null;
}
}
public class WholeFileRecordReader extends RecordReader<NullWritable, Text> {
private final Text mFileText;
public WholeFileRecordReader(CombineFileSplit fileSplit, TaskAttemptContext context,
Integer pathToProcess) throws IOException {
mProcessed = false;
mFileToRead = fileSplit.getPath(pathToProcess);
mFileLength = fileSplit.getLength(pathToProcess);
mConf = context.getConfiguration();
assert 0 == fileSplit.getOffset(pathToProcess);
FileSystem fs = FileSystem.get(mConf);
assert fs.getFileStatus(mFileToRead).getLen() == mFileLength;
// mFileName = new Text();
mFileText = new Text();
}
@Override
public void close() throws IOException {
mFileText.clear();
}
@Override
public NullWritable getCurrentKey() throws IOException, InterruptedException {
return NullWritable.get();
}
@Override
public Text getCurrentValue() throws IOException, InterruptedException {
return mFileText;
}
@Override
public float getProgress() throws IOException, InterruptedException {
return (mProcessed) ? (float) 1.0 : (float) 0.0;
}
@Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
// no-op.
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (!mProcessed) {
if (mFileLength > (long) Integer.MAX_VALUE) {
throw new IOException("File is longer than Integer.MAX_VALUE.");
}
byte[] contents = new byte[(int) mFileLength];
FileSystem fs = mFileToRead.getFileSystem(mConf);
FSDataInputStream in = null;
try {
// Set the contents of this file.
in = fs.open(mFileToRead);
IOUtils.readFully(in, contents, 0, contents.length);
mFileText.set(contents, 0, contents.length);
} finally {
IOUtils.closeStream(in);
}
mProcessed = true;
return true;
}
return false;
}
}
你能帮我吗?
1条答案
按热度按时间ajsxfq5m1#
Map器的数量不是由文件的数量驱动的,而是由组成这些文件的块的数量驱动的;因此,hadoop将每个文件分割成块,并为每个块创建一个Map器。请看一下这样的链接,以了解更多关于hadoop如何选择Map器和还原器的数量。
如果您确实需要一个Map器,则必须说明设置此参数
mapred.map.tasks
不会工作,因为这是hadoop的提示,不是强制参数。您可以尝试将块大小增加到一个非常高的数字。。。不管怎样,在hadoop中使用单个Map器是没有意义的。。。您将错过数据的分布式处理,这是这样一个系统的优点之一。