在hadoop map reduce中读取excel文件

ldioqlga 于 2021-06-03 发布在 Hadoop

关注(0)|答案(3)|浏览(809)

我正在尝试读取一个excel文件，其中包含一些数据，以便在hadoop中进行聚合。map reduce程序似乎工作正常，但输出的格式不可读。是否需要在hadoop map reduce中为excel文件使用任何特殊的inputformat读取器？我的配置如下所示

Configuration conf=getConf();
Job job=new Job(conf,"LatestWordCount");
job.setJarByClass(FlightDetailsCount.class);
Path input=new Path(args[0]);
Path output=new Path(args[1]);
FileInputFormat.setInputPaths(job, input);
FileOutputFormat.setOutputPath(job, output);
job.setMapperClass(MapClass.class);
job.setReducerClass(ReduceClass.class);
//job.setCombinerClass(ReduceClass.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
//job.setOutputKeyClass(Text.class);
//job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true)?0:1);
return 0;

输出结果如下�千瓦��o�一��]n��ε��r3级�\n“��p�饚6瓦�jj公司��9瓦�f级=��9毫升��博士�是的/ք��7�^�我��米ք�^新西兰��我��^�)��妗j�(��博士ͱ/7�ts公司��米/7�ts公司��&�jz公司��o��tsr公司�7�@�)�o��t型ӻ��5{%��+��ۆ�第六周-��=�e�_}米�)~��ʅ��ژ��: #�j�]��u��>

hadoop mapreduce bigdata

来源：https://stackoverflow.com/questions/15868631/reading-a-excel-file-in-hadoop-map-reduce

3条答案

按热度按时间

dz6r00yl1#

我不知道是否有人真的为ms excel文件开发了自定义输入格式（我对此表示怀疑，快速研究也没有发现任何结果），但您肯定不能使用textinputformat读取excel文件。xsl文件是二进制的。
解决方案：将excel文件导出到csv或tsv，然后使用textinputformat加载它们。

赞(0）回复(0）举报 2021-06-03

ulydmbyx2#

您还可以使用hadoopoffice库，它允许您使用hadoop和spark读/写excel。它可以在maven central和spark软件包上使用。
https://github.com/zuinnote/hadoopoffice/wiki

赞(0）回复(0）举报 2021-06-03

f0brbegy3#

我知道有点晚了，但现在有人已经创建了excel输入格式作为解决此类问题的标准解决方案。读这个-https://sreejithrpillai.wordpress.com/2014/11/06/excel-inputformat-for-hadoop-mapreduce/
那里有一个github项目和codebase。
看这里-https://github.com/sreejithpillai/excelrecordreadermapreduce/

赞(0）回复(0）举报 2021-06-03

我来回答

在hadoop map reduce中读取excel文件

3条答案

相关问题

热门标签

最新问答