我试图从api调用中读取多个json文件,所有文件都包含一个可以是字符串或数组的字段
file1.json(当myfield只有一个值时,api返回不带括号的内容):
{"id":"1","myfield":"AAAA"}
file2.json(当myfield有多个值时,api返回带括号的内容):
{"id":"2","myfield":["BBBB","CCCC"]}
如果我使用第一个json来创建模式,我就不能分解myfield,因为它是一个字符串
sh = spark.read.option("multiline","true").json("./test/1.json")
new_schema = StructType.fromJson(sh.schema.jsonValue())
sh1 = spark.read.option("multiline","true").schema(new_schema).json("./test")
sh1.printSchema()
sh1.createOrReplaceTempView("sh1")
spark.sql("select id, myfield from sh1").show()
root
|-- id: string (nullable = true)
|-- myfield: string (nullable = true)
+---+---------------+
| id| myfield|
+---+---------------+
| 2|["BBBB","CCCC"]|
| 1| AAAA|
+---+---------------+
如果我使用第二个json文件来创建schema,我可以分解myfield,但是我松开了“string”一个
sh = spark.read.option("multiline","true").json("./test/2.json")
new_schema = StructType.fromJson(sh.schema.jsonValue())
sh2 = spark.read.option("multiline","true").schema(new_schema).json("./test")
sh2.printSchema()
sh2.createOrReplaceTempView("sh2")
sh2.show()
spark.sql("select id, explode(myfield) as myvalues from sh2").show()
root
|-- id: string (nullable = true)
|-- myfield: array (nullable = true)
| |-- element: string (containsNull = true)
+----+------------+
| id| myfield|
+----+------------+
| 2|[BBBB, CCCC]|
|null| null|
+----+------------+
+---+--------+
| id|myvalues|
+---+--------+
| 2| BBBB|
| 2| CCCC|
+---+--------+
有这样的解决方案来读取文件和检索值吗
+---+--------+
| id|myvalues|
+---+--------+
| 1| AAAA|
| 2| BBBB|
| 2| CCCC|
+---+--------+
谢谢你的帮助
暂无答案!
目前还没有任何答案,快来回答吧!