如何在pyspark或sql中将1行4列 Dataframe 转换为4行2列 Dataframe

k97glaaz  于 2022-11-25  发布在  Apache
关注(0)|答案(2)|浏览(192)

我有一个 Dataframe ,它将输出返回为

我想把这个转换成

有没有人能帮助了解如何准备pyspark代码来动态地实现这个结果。我已经在sql中尝试过Unpivot,但没有运气。

t9eec4r0

t9eec4r01#

df =spark.createDataFrame([
(78,20,19,90),
],
('Machines',  'Books',  'Vehicles', 'Plants'))

创建一个新的结构列数组,该数组将列名和值名组合在一起。使用magic inline分解结构字段。代码如下

df.withColumn('tab', F.array(*[F.struct(lit(x).alias('Fields'), col(x).alias('Count')).alias(x) for x in df.columns])).selectExpr('inline(tab)').show()

+--------+-----+
|  Fields|Count|
+--------+-----+
|Machines|   78|
|   Books|   20|
|Vehicles|   19|
|  Plants|   90|
+--------+-----+
qlckcl4x

qlckcl4x2#

unpivot-dataframe tutoral中所述用途:

df = df.selectExpr("""stack(4, "Machines", Machines, "Books", Books, "Vehicles", Vehicles, "Plants", Plants) as (Fields, Count)""")

或者概括一下:

cols = [f'"{c}", {c}' for c in df.columns]
exprs = f"stack({len(cols)}, {', '.join(str(c) for c in cols)}) as (Fields, Count)"
df = df.selectExpr(exprs)

完整示例:

df = spark.createDataFrame(data=[[78,20,19,90]], schema=['Machines','Books','Vehicles','Plants'])

# Hard coded
# df = df.selectExpr("""stack(4, "Machines", Machines, "Books", Books, "Vehicles", Vehicles, "Plants", Plants) as (Fields, Count)""")

# Generalised
cols = [f'"{c}", {c}' for c in df.columns]
exprs = f"stack({len(cols)}, {', '.join(str(c) for c in cols)}) as (Fields, Count)"
df = df.selectExpr(exprs)

[Out]:
+--------+-----+
|Fields  |Count|
+--------+-----+
|Machines|78   |
|Books   |20   |
|Vehicles|19   |
|Plants  |90   |
+--------+-----+

相关问题