当一个字段类型更改时,使用pyspark读取多个json文件

7cwmlq89  于 2021-05-27  发布在  Spark
关注(0)|答案(0)|浏览(318)

我试图从api调用中读取多个json文件,所有文件都包含一个可以是字符串或数组的字段
file1.json(当myfield只有一个值时,api返回不带括号的内容):

{"id":"1","myfield":"AAAA"}

file2.json(当myfield有多个值时,api返回带括号的内容):

{"id":"2","myfield":["BBBB","CCCC"]}

如果我使用第一个json来创建模式,我就不能分解myfield,因为它是一个字符串

sh = spark.read.option("multiline","true").json("./test/1.json")
new_schema = StructType.fromJson(sh.schema.jsonValue())
sh1 = spark.read.option("multiline","true").schema(new_schema).json("./test")
sh1.printSchema()
sh1.createOrReplaceTempView("sh1")
spark.sql("select id, myfield from sh1").show()
root
 |-- id: string (nullable = true)
 |-- myfield: string (nullable = true)

+---+---------------+
| id|        myfield|
+---+---------------+
|  2|["BBBB","CCCC"]|
|  1|           AAAA|
+---+---------------+

如果我使用第二个json文件来创建schema,我可以分解myfield,但是我松开了“string”一个

sh = spark.read.option("multiline","true").json("./test/2.json")
new_schema = StructType.fromJson(sh.schema.jsonValue())
sh2 = spark.read.option("multiline","true").schema(new_schema).json("./test")
sh2.printSchema()
sh2.createOrReplaceTempView("sh2")
sh2.show()
spark.sql("select id, explode(myfield) as myvalues from sh2").show()
root
 |-- id: string (nullable = true)
 |-- myfield: array (nullable = true)
 |    |-- element: string (containsNull = true)

+----+------------+
|  id|     myfield|
+----+------------+
|   2|[BBBB, CCCC]|
|null|        null|
+----+------------+

+---+--------+
| id|myvalues|
+---+--------+
|  2|    BBBB|
|  2|    CCCC|
+---+--------+

有这样的解决方案来读取文件和检索值吗

+---+--------+
| id|myvalues|
+---+--------+
|  1|    AAAA|
|  2|    BBBB|
|  2|    CCCC|
+---+--------+

谢谢你的帮助

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题