用mapreduce构造非结构化数据

0mkxixxg  于 2021-06-03  发布在  Hadoop
关注(0)|答案(1)|浏览(435)

我有一个日志文件,下面给出了一个快照:

<Dec 12, 2013 2:46:24 AM CST> <Error> <java.rmi.RemoteException>
<Dec 13, 2013 2:46:24 AM CST> <Error> <Io exception>
<Dec 14, 2013 2:46:24 AM CST> <Error> <garbage data
garbage data
garbade data
Io exception
>
<jan 01, 2014 2:46:24 AM CST> <Error> <garbage data
garbage data java.rmi.RemoteException
>

我正试图在此基础上进行分析。
我想做的是:
我想得到每年的异常数

for Example: from above sample data my output should be

    java.rmi.RemoteException 2013 1
    Io exception             2013 2
    java.rmi.RemoteException 2014 1

我的问题是:

1.You see hadoop processes line by line of a text file, so it considers Io exception as
 a part of line 6 whereas it should be a part of line 3 (that is continued till line 7).

2. I can't use N line input formatter because ther's no fixed pattern of lines.

模式是什么,我想要什么:

The only pattern I see is that a line starts with a "<" and ends with a ">". In the 
above example line 3 doesn't end with ">" hence I want the compiler to consider all the 
data in the same line until it fetches a ">".

我希望编译器看到的示例数据是:

<Dec 12, 2013 2:46:24 AM CST> <Error> <java.rmi.RemoteException>
<Dec 13, 2013 2:46:24 AM CST> <Error> <Io exception>
<Dec 14, 2013 2:46:24 AM CST> <Error> <garbage data garbage data garbade data Io exception>
<jan 01, 2014 2:46:24 AM CST> <Error> <garbage data garbage data java.rmi.RemoteException>

如果有人能分享一段代码或想法来克服这个问题,我会很高兴的。
提前感谢:)

rxztt3cl

rxztt3cl1#

您需要实现inputformat和recordreader。您真正需要的是对streaminputformat的修改。这在hadoop流项目中是存在的。
对于多行xml的用法,我们使用hadoop straeming从开始标记到结束标记进行读取。您可以检查源代码并根据您的需求进行调整。

相关问题