在cloudera中用serde加载json文件

zfciruhq 于 2021-06-04 发布在 Hadoop

关注(0)|答案(2)|浏览(402)

我正在尝试使用具有以下包结构的json文件：

{
   "user_id": "kim95",
   "type": "Book",
   "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.",
   "year": "1995",
   "publisher": "ACM Press and Addison-Wesley",
   "authors": [
      {
         "name": "null"
      }
   ],
   "source": "DBLP"
}
{
   "user_id": "marshallo79",
   "type": "Book",
   "title": "Inequalities: Theory of Majorization and Its Application.",
   "year": "1979",
   "publisher": "Academic Press",
   "authors": [
      {
         "name": "Albert W. Marshall" 
      },
      {
         "name": "Ingram Olkin"
      }
   ],
   "source": "DBLP"
}

我尝试使用serde为配置单元加载json数据。我遵循我在这里看到的两种方式：http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
使用此代码：

CREATE EXTERNAL TABLE IF NOT EXISTS serd (
           user_id:string, 
           type:string, 
           title:string,
           year:string,
           publisher:string,
           authors:array<struct<name:string>>,
           source:string)       
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    LOCATION '/user/hdfs/data/book-seded_workings-reduced.json';

我有个错误：

error while compiling statement: failed: parseexception line 2:17 cannot recognize input near ':' 'string' ',' in column type

我试过这个版本：https://github.com/rcongiu/hive-json-serde
它给出了一个不同的错误：

Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.openx.data.jsonserde.JsonSerde

你知道吗？
我还想知道有什么方法可以替代使用这样的json来查询“authors”中的“name”字段。是Pig还是 hive ？
我已经把它转换成一个“tsv”文件了。但是，由于我的authors列是一个元组，如果我从这个文件构建一个表，我不知道如何使用hive对'name'发出请求。我应该为“tsv”转换更改脚本还是保留它？或者有没有其他的Hive或Pig的替代品？

hadoop Hive hue cloudera-cdh apache-pig

来源：https://stackoverflow.com/questions/25149700/loading-json-file-with-serde-in-cloudera

2条答案

按热度按时间

hsvhsicv1#

配置单元没有内置的json支持。因此，要在配置单元中使用json，我们需要使用第三部分JAR，如：https://github.com/rcongiu/hive-json-serde
create table语句有几个问题。应该是这样的：

CREATE EXTERNAL TABLE IF NOT EXISTS serd ( 
user_id string,type string,title string,year string,publisher string,authors array<string>,source:string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION...

您使用的json记录将每条记录保存在一行中，如下所示：

{"user_id": "kim95", "type": "Book", "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.", "year": "1995", "publisher": "ACM Press and Addison-Wesley", "authors": [{"name":"null"}], "source": "DBLP"} 
{"user_id": "marshallo79", "type": "Book", "title": "Inequalities: Theory of Majorization and Its Application.", "year": "1979", "publisher": "Academic Press","authors": [{"name":"Albert W. Marshall"},{"name":"Ingram Olkin"}], "source": "DBLP"}

从git下载项目之后，您需要编译将创建jar的项目，您需要在运行create table语句之前将这个jar添加到hive会话中。
希望对你有帮助。。。！！！

赞(0）回复(0）举报 2021-06-04

o2rvlv0m2#

addjar只添加到会话中，而会话将不可用，最后它会出错。将jar加载到hive的所有节点上，并将reduce路径Map到下面的位置，这样hive和map reduce组件将在调用它时选择它。
/hadoop/cdh\u 5.2.0\u linux\u包裹/parcels/cdh-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib/json-serde-1.3.6-jar-with-dependencies.jar
/hadoop/cdh\u 5.2.0\u linux\u parcels/parcels/cdh-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-mapreduce/lib/json-serde-1.3.6-jar-with-dependencies.jar
注意：此路径因集群而异。

赞(0）回复(0）举报 2021-06-04

我来回答

在cloudera中用serde加载json文件

2条答案

相关问题

热门标签

最新问答