嵌套json展平sparkDataframe

envsm3lx  于 2021-05-24  发布在  Spark
关注(0)|答案(2)|浏览(557)

我正在尝试从嵌套的jsonstring创建一个Dataframe,并将其拆分为多个Dataframe,即外部元素数据将转到一个Dataframe,嵌套的子数据将转到另一个Dataframe。可能有多个嵌套元素。我查看了其他帖子,没有一篇文章提供了下面场景的工作示例。下面是一个州数是动态的示例,我想将国家信息和州信息存储在两个单独的hdfs文件夹中。因此父Dataframe包含如下所示的行。
val jsonstr=“”{“country”:“us”,“isd”:“001”,“states”:[{“state1”:“nj”,“state2”:“ny”,“state3”:“pa”}]}“”

val countryDf = spark.read.json(Seq(jsonStr).toDS)

countryDf.show(false)

+---+-------+--------------+
|ISD|country|states        |
+---+-------+--------------+
|001|US     |[[NJ, NY, PA]]|
+---+-------+--------------+

countryDf.withColumn("states",explode($"states")).show(false)

val statesDf = countryDf.select(explode(countryDf("states").as("states")))
statesDf.show(false)
+------------+
|col         |
+------------+
|[NJ, NY, PA]|
+------------+

Expected out put  
2 Dataframes 

countryDf
+---+-------+
|ISD|country|
+---+-------+
|001|US     |
+---+-------+

statesDf 

+------+-------+-------+-------+
country| state1| state2|  state3
+------+---------------+-------+
US     |  NJ      NY      PA
+------+-------+-------+-------+

我查看了stack overflow中关于嵌套json扁平化的其他问题。没有人能解决同样的问题。

w46czmvw

w46czmvw1#

下面是一段代码。您应该考虑性能和列数是否很大。我已经收集了所有的Map字段,并将它们添加到dataframe中。

val jsonStr="""{"country":"US","ISD":"001","states":[{"state1":"NJ","state2":"NY","state3":"PA"}]}"""
import spark.implicits._

val countryDf = spark.read.json(Seq(jsonStr).toDS)

countryDf.show(false)
val statesDf = countryDf.select($"country", explode($"states").as("states"))

val index = statesDf.schema.fieldIndex("states")
val stateSchema = statesDf.schema(index).dataType.asInstanceOf[StructType]
var columns = mutable.LinkedHashSet[Column]()
stateSchema.fields.foreach(field =>{
  columns.add(lit(field.name))
  columns.add(col( "state." + field.name))
})

val s2 = statesDf
  .withColumn("statesMap", map(columns.toSeq: _*))

val allMapKeys = s2.select(explode($"statesMap")).select($"key").distinct.collect().map(_.get(0).toString)

val s3 = allMapKeys.foldLeft(s2)((a, b) => a.withColumn(b, a("statesMap")(b)))
  .drop("statesMap")
s3.show(false)
fruv7luv

fruv7luv2#

当读取嵌套的json并将其转换为数据集时,嵌套的部分将存储为结构类型。因此,您必须考虑在Dataframe中展平结构类型。

val jsonStr="""{"country":"US","ISD":"001","states":[{"state1":"NJ","state2":"NY","state3":"PA"}]}"""
val countryDf = spark.read.json(Seq(jsonStr).toDS)

countryDf.show(false)
+---+-------+--------------+
|ISD|country|states        |
+---+-------+--------------+
|001|US     |[[NJ, NY, PA]]|
+---+-------+--------------+

val countryDfExploded = countryDf.withColumn("states",explode($"states"))
countryDfExploded.show(false)
+---+-------+------------+
|ISD|country|states      |
+---+-------+------------+
|001|US     |[NJ, NY, PA]|
+---+-------+------------+

val countrySelectDf = countryDfExploded.select($"ISD", $"country")
countrySelectDf.show(false)
+---+-------+
|ISD|country|
+---+-------+
|001|US     |
+---+-------+

val statesDf = countryDfExploded.select( $"country",$"states.*")
statesDf.show(false)
+-------+------+------+------+
|country|state1|state2|state3|
+-------+------+------+------+
|US     |NJ    |NY    |PA    |
+-------+------+------+------+

相关问题