分解pyspark中的嵌套结构并使用结构名作为列

5gfr0r5j 于 2021-05-27 发布在 Spark

关注(0)|答案(0)|浏览(240)

我正在pyspark中读取json文件，其中第三级嵌套的结构名每行都不同。

Schema looks something like below

 |-- A: string (nullable = true)
 |-- Plugins: struct (nullable = true)
 |    |-- RfS: struct (nullable = true)
 |    |    |-- A: string (nullable = true)
 |    |    |-- B: string (nullable = true)
 |    |-- RtW: struct (nullable = true)
 |    |    |-- A: string (nullable = true)
 |    |    |-- B: string (nullable = true)

which I want to convert to dataframe of following schma

 |-- A: string (nullable = true)
 |-- Plugins: struct (nullable = true)
 |-- A: string (nullable = true)
 |-- B: string (nullable = true)

Plugins will contain value from struct name RfS/RtW etc.

我读取数据并移除第一级嵌套

jsonData = """{
  "A" : "some A",
  "Plugins": {
    "RfS": {
      "A" : "RfSA",
      "B" : "RfSB"
    },
    "RtW" : {
      "A" : "RtWA",
      "B" : "RtWA"
    }
  }
}"""

df = spark.read.json(sc.parallelize([jsonData]))
no_plug_cols = ["A"] # cols not in Plugins i.e A
plug_df = df.select("A", "Plugins.*")

# plug_df.printSchema()

# root

# |-- A: string (nullable = true)

# |-- RfS: struct (nullable = true)

# |    |-- A: string (nullable = true)

# |    |-- B: string (nullable = true)

# |-- RtW: struct (nullable = true)

# |    |-- A: string (nullable = true)

# |    |-- B: string (nullable = true)

根据这里的一个答案，我发现得到下面的答案很简单

icols = [(col(f"{c}.A").alias(f"{c}.A"), col(f"{c}.B").alias(f"{c}.B")) for c in (set(plug_df.columns) - set(no_plug_cols))]

# we use chain to flatten icols which is a list of tuples

plug_df.select(no_plug_cols + list(chain(*icols))).show()

# +------+-----+-----+-----+-----+

# |     A|RfS.A|RfS.B|RtW.A|RtW.B|

# +------+-----+-----+-----+-----+

# |some A| RfSA| RfSB| RtWA| RtWA|

# +------+-----+-----+-----+-----+

有没有一种方法可以代替上面的输出，我可以得到rfs/rtw作为列值，并具有所需的名称，以便输出如下所示。只有在转换为上述格式后使用pivot转换数据，才有可能吗


# +------+-----  +-----+-----+

# |     A|Plugins|    A|    B|

# +------+-------+-----+-----+

# |some A| RfS   | RfSA| RfSB|

# +------+-------+-----+-----+

# |some A| RtW   | RfWA| RtWA|

# +------+-------+-----+-----+

apache-spark pyspark

来源：https://stackoverflow.com/questions/63726711/exploding-nested-struct-in-pyspark-and-use-struct-name-as-column

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

分解pyspark中的嵌套结构并使用结构名作为列

暂无答案！

相关问题

热门标签

最新问答