我有一个日志文件如下
Begin ... 12-07-2008 02:00:05 ----> record1
incidentID: inc001
description: blah blah blah
owner: abc
status: resolved
end .... 13-07-2008 02:00:05
Begin ... 12-07-2008 03:00:05 ----> record2
incidentID: inc002
description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
owner: abc
status: resolved
end .... 13-07-2008 03:00:05
我想用mapreduce来处理这个。我想提取事件id、状态以及事件所花费的时间
如何处理这两个记录,因为它们具有可变的记录长度,以及如果在记录结束之前发生了输入拆分怎么办。
2条答案
按热度按时间cwtwac6a1#
在您的示例中,每条记录的行数相同。如果是这种情况,您可以使用nlinesinputformat,如果不可能知道行数,可能会更困难(有关nlinesinputformat的详细信息:http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/nlineinputformat.html )
o2gm4chl2#
您需要编写自己的输入格式和记录读取器,以确保围绕记录分隔符进行正确的文件分割。
基本上,您的记录阅读器将需要寻找它的分割字节偏移量,向前扫描(读取行),直到找到:
这个
Begin ...
线路读到下一行
end ...
行,并在开始和结束之间提供这些行作为下一条记录的输入它扫描过去的分裂结束或发现eof
这在算法上类似于mahout的xmlinputformat如何处理多行xml作为输入—事实上,您可以直接修改源代码来处理您的情况。
正如@irw的回答中提到的,
NLineInputFormat
是另一种选择,如果您的记录每个记录有固定的行数,但是对于较大的文件来说效率很低,因为它必须打开并读取整个文件才能发现输入格式中的行偏移getSplits()
方法。