我使用的是cdh5 quickstart vm,我有这样一个文件(这里不完整):
{"user_id": "kim95",
"type": "Book",
"title": "Modern Database Systems: The Object Model, Interoperability, and
Beyond.",
"year": "1995",
"publisher": "ACM Press and Addison-Wesley",
"authors": {},
"source": "DBLP"
}
{"user_id": "marshallo79",
"type": "Book",
"title": "Inequalities: Theory of Majorization and Its Application.",
"year": "1979",
"publisher": "Academic Press",
"authors": {("Albert W. Marshall"), ("Ingram Olkin")},
"source": "DBLP"
}
我用了这个脚本:
books = load 'data/book-seded.json'
using JsonLoader('t1:tuple(user_id:
chararray,type:chararray,title:chararray,year:chararray,publisher:chararray,source:chararray,authors:bag{T:tuple(author:chararray)})');
STORE books INTO 'book-no-seded.tsv';
脚本正常,但是生成的文件是空的,你知道吗?
3条答案
按热度按时间n6lpvg4x1#
尝试使用org.apache.pig.piggybank.storage.jsonstorage()将书籍存储到'book no seed.tsv';
62o28rlo2#
您需要确保加载模式是好的。你可以试着做一个
DUMP books
快速检查。在本教程中使用pigjsonload时,我们必须小心输入数据和模式http://gethue.com/hadoop-tutorials-ii-1-prepare-the-data-for-analysis/.
5kgi1eie3#
最后,只有这个模式有效:如果我添加或删除一个与这个配置不同的空间,那么我会有一个错误(我还为元组添加了“name”,并在元组为空时指定了“null”,并且更改了作者和源代码之间的顺序,但是即使没有这个配置,它仍然是错误的)
工作脚本是这样的: