如何使用python解析spark1.6中格式不正确的json字符串,该字符串包含空格、额外的双引号和反斜杠?

ix0qys7i  于 2021-05-27  发布在  Hadoop
关注(0)|答案(1)|浏览(377)

这个问题在这里已经有答案了

如何使用spark dataframes查询json数据列(5个答案)
去年关门了。
使用pyspark1.6,我试图使用clouderahadoop发行版从hdfs文件位置解析、读取和加载给定的json文件,但我一直遇到麻烦。以下是在hadoop中执行“-cat”后json字符串的外观:

{"json_data":"{\"table\":\"TEST.FUBAR\",\"op_type\":\"I\",\"op_ts\":\"2019-03-14 15:33:50.031848\",\"current_ts\":\"2019-03-14T15:33:57.479002\",\"pos\":\"1111\",\"after\":{\"COL1\":949494949494949494,\"COL2\":99,\"COL3\":2,\"COL4\":\"            99999\",\"COL5\":9999999,\"COL6\":90,\"COL7\":42478,\"COL8\":\"I\",\"COL9\":null,\"COL10\":\"2019-03-14 15:33:49\",\"COL11\":null,\"COL12\":null,\"COL13\":null,\"COL14\":\"x222263 \",\"COL15\":\"2019-03-14 15:33:49\",\"COL16\":\"x222263 \",\"COL17\":\"2019-03-14 15:33:49\",\"COL18\":\"2020-09-10 00:00:00\",\"COL19\":\"A\",\"COL20\":\"A\",\"COL21\":0,\"COL22\":null,\"COL23\":\"2019-03-14 15:33:47\",\"COL24\":2,\"COL25\":2,\"COL26\":\"R\",\"COL27\":\"2019-03-14 15:33:49\",\"COL28\":\"  \",\"COL29\":\"PBU67H   \",\"COL30\":\"            20000\",\"COL31\":2,\"COL32\":null}}"}

然后,我尝试使用以下方法从json文件创建Dataframe:

df = sqlContext.read.json("test_data.json")

然后,我运行pyspark脚本,将Dataframe以parquet格式插入temp-hive表,选择时如下所示:

+------------------+
|         json_data|
+------------------+
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
+------------------+
only showing top 20 rows

我需要所有json属性和值以行和列格式显示,但它显示为一个长字符串。由于我的工作是通过一个具有严格防火墙设置的代理完成的,所以我目前无法使用“hivejson-serde”jar将其解析为json(这将真正解决我的一系列问题)。我尝试了posexplode或侧向视图explode,但它们是针对struct类型而不是arraybuffer类型的。
在不安装任何第三方软件、不更改spark sql执行引擎或集群上的任何其他管理设置的情况下(因为我是一个普通的用户失败者),cloudera 5.x是否有办法将此字符串解析为json或使用pyspark将值插入到具有行、列等的非规范化表中?

wlwcrazw

wlwcrazw1#

“json_data”内容实际上是一个字符串,而不是内置了数组、Map和结构等模式结构的json。我的问题是,在spark试图读取“json\u数据”时,在“json\u数据”的实际内部使用了多余的双引号(“)。样品:

{"json_data":"{"table":"TEST.FUBAR","op_type":"I","op_ts":"2019-03-14 15:33:50.031848","current_ts":"2019-03-14T15:33:57.479002","pos":"1111","after":{"COL1":949494949494949494,"COL2":99,"COL3":2,"COL4":"            99999","COL5":9999999,"COL6":90,"COL7":42478,"COL8":"I","COL9":null,"COL10":"2019-03-14 15:33:49","COL11":null,"COL12":null,"COL13":null,"COL14":"x222263 ","COL15":"2019-03-14 15:33:49","COL16":"x222263 ","COL17":"2019-03-14 15:33:49","COL18":"2020-09-10 00:00:00","COL19":"A","COL20":"A","COL21":0,"COL22":null,"COL23":"2019-03-14 15:33:47","COL24":2,"COL25":2,"COL26":"R","COL27":"2019-03-14 15:33:49","COL28":"  ","COL29":"PBU67H   ","COL30":"            20000","COL31":2,"COL32":null}}"}

我在去掉双引号后使用了这个示例:

{"json_data":{"table":"TEST.FUBAR","op_type":"I","op_ts":"2019-03-14 15:33:50.031848","current_ts":"2019-03-14T15:33:57.479002","pos":"1111","after":{"COL1":949494949494949494,"COL2":99,"COL3":2,"COL4":"            99999","COL5":9999999,"COL6":90,"COL7":42478,"COL8":"I","COL9":null,"COL10":"2019-03-14 15:33:49","COL11":null,"COL12":null,"COL13":null,"COL14":"x222263 ","COL15":"2019-03-14 15:33:49","COL16":"x222263 ","COL17":"2019-03-14 15:33:49","COL18":"2020-09-10 00:00:00","COL19":"A","COL20":"A","COL21":0,"COL22":null,"COL23":"2019-03-14 15:33:47","COL24":2,"COL25":2,"COL26":"R","COL27":"2019-03-14 15:33:49","COL28":"  ","COL29":"PBU67H   ","COL30":"            20000","COL31":2,"COL32":null}}}

我可能需要使用一些正则表达式或某种函数实用程序来删除数据周围的双引号。但在修改它并运行pyspark之后,我得到了:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
SparkContext available as sc, HiveContext available as sqlContext.
>>> filePath = "/user/no_quote_json.json"
>>> df = sqlContext.read.json(filePath)
>>> df.printSchema()
root
 |-- json_data: struct (nullable = true)
 |    |-- after: struct (nullable = true)
 |    |    |-- COL1: long (nullable = true)
 |    |    |-- COL10: string (nullable = true)
 |    |    |-- COL11: string (nullable = true)
 |    |    |-- COL12: string (nullable = true)
 |    |    |-- COL13: string (nullable = true)
 |    |    |-- COL14: string (nullable = true)
 |    |    |-- COL15: string (nullable = true)
 |    |    |-- COL16: string (nullable = true)
 |    |    |-- COL17: string (nullable = true)
 |    |    |-- COL18: string (nullable = true)
 |    |    |-- COL19: string (nullable = true)
 |    |    |-- COL2: long (nullable = true)
 |    |    |-- COL20: string (nullable = true)
 |    |    |-- COL21: long (nullable = true)
 |    |    |-- COL22: string (nullable = true)
 |    |    |-- COL23: string (nullable = true)
 |    |    |-- COL24: long (nullable = true)
 |    |    |-- COL25: long (nullable = true)
 |    |    |-- COL26: string (nullable = true)
 |    |    |-- COL27: string (nullable = true)
 |    |    |-- COL28: string (nullable = true)
 |    |    |-- COL29: string (nullable = true)
 |    |    |-- COL3: long (nullable = true)
 |    |    |-- COL30: string (nullable = true)
 |    |    |-- COL31: long (nullable = true)
 |    |    |-- COL32: string (nullable = true)
 |    |    |-- COL4: string (nullable = true)
 |    |    |-- COL5: long (nullable = true)
 |    |    |-- COL6: long (nullable = true)
 |    |    |-- COL7: long (nullable = true)
 |    |    |-- COL8: string (nullable = true)
 |    |    |-- COL9: string (nullable = true)
 |    |-- current_ts: string (nullable = true)
 |    |-- op_ts: string (nullable = true)
 |    |-- op_type: string (nullable = true)
 |    |-- pos: string (nullable = true)
 |    |-- table: string (nullable = true)

>>> df.select("json_data.after.col29").show()
+---------+
|    col29|
+---------+
|PBU67H   |
+---------+

一旦它以这种格式出现,我所做的所有其他代码都将展开struct对象,并将所有其他逻辑插入到展开的hive表中。希望这对以后遇到类似问题的每个人都有帮助。

相关问题