我正在pyspark中读取json文件,其中第三级嵌套的结构名每行都不同。
Schema looks something like below
|-- A: string (nullable = true)
|-- Plugins: struct (nullable = true)
| |-- RfS: struct (nullable = true)
| | |-- A: string (nullable = true)
| | |-- B: string (nullable = true)
| |-- RtW: struct (nullable = true)
| | |-- A: string (nullable = true)
| | |-- B: string (nullable = true)
which I want to convert to dataframe of following schma
|-- A: string (nullable = true)
|-- Plugins: struct (nullable = true)
|-- A: string (nullable = true)
|-- B: string (nullable = true)
Plugins will contain value from struct name RfS/RtW etc.
我读取数据并移除第一级嵌套
jsonData = """{
"A" : "some A",
"Plugins": {
"RfS": {
"A" : "RfSA",
"B" : "RfSB"
},
"RtW" : {
"A" : "RtWA",
"B" : "RtWA"
}
}
}"""
df = spark.read.json(sc.parallelize([jsonData]))
no_plug_cols = ["A"] # cols not in Plugins i.e A
plug_df = df.select("A", "Plugins.*")
# plug_df.printSchema()
# root
# |-- A: string (nullable = true)
# |-- RfS: struct (nullable = true)
# | |-- A: string (nullable = true)
# | |-- B: string (nullable = true)
# |-- RtW: struct (nullable = true)
# | |-- A: string (nullable = true)
# | |-- B: string (nullable = true)
根据这里的一个答案,我发现得到下面的答案很简单
icols = [(col(f"{c}.A").alias(f"{c}.A"), col(f"{c}.B").alias(f"{c}.B")) for c in (set(plug_df.columns) - set(no_plug_cols))]
# we use chain to flatten icols which is a list of tuples
plug_df.select(no_plug_cols + list(chain(*icols))).show()
# +------+-----+-----+-----+-----+
# | A|RfS.A|RfS.B|RtW.A|RtW.B|
# +------+-----+-----+-----+-----+
# |some A| RfSA| RfSB| RtWA| RtWA|
# +------+-----+-----+-----+-----+
有没有一种方法可以代替上面的输出,我可以得到rfs/rtw作为列值,并具有所需的名称,以便输出如下所示。只有在转换为上述格式后使用pivot转换数据,才有可能吗
# +------+----- +-----+-----+
# | A|Plugins| A| B|
# +------+-------+-----+-----+
# |some A| RfS | RfSA| RfSB|
# +------+-------+-----+-----+
# |some A| RtW | RfWA| RtWA|
# +------+-------+-----+-----+
暂无答案!
目前还没有任何答案,快来回答吧!