使用如下 Dataframe :
from pyspark.sql.functions import avg, first
rdd = sc.parallelize(
[
(0, "A", 223,"201603", "PORT"),
(0, "A", 22,"201602", "PORT"),
(0, "A", 422,"201601", "DOCK"),
(1,"B", 3213,"201602", "DOCK"),
(1,"B", 3213,"201601", "PORT"),
(2,"C", 2321,"201601", "DOCK")
]
)
df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])
df_data.show()
我做一个旋转,
df_data.groupby(df_data.id, df_data.type).pivot("date").agg(avg("cost"), first("ship")).show()
+---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+
| id|type|201601_avg(cost)|201601_first(ship)()|201602_avg(cost)|201602_first(ship)()|201603_avg(cost)|201603_first(ship)()|
+---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+
| 2| C| 2321.0| DOCK| null| null| null| null|
| 0| A| 422.0| DOCK| 22.0| PORT| 223.0| PORT|
| 1| B| 3213.0| PORT| 3213.0| DOCK| null| null|
+---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+
但是我得到了这些非常复杂的列名称。在聚合上应用alias
通常可以工作,但是由于pivot
,在这种情况下名称更糟糕:
+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+
| id|type|201601_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201601_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|201602_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201602_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|201603_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201603_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|
+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+
| 2| C| 2321.0| DOCK| null| null| null| null|
| 0| A| 422.0| DOCK| 22.0| PORT| 223.0| PORT|
| 1| B| 3213.0| PORT| 3213.0| DOCK| null| null|
+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+
有没有一种方法可以在透视表和聚合中动态地重命名列名?
6条答案
按热度按时间g0czyy6m1#
您可以直接为聚合设置别名:
eiee3dmh2#
一个简单的正则表达式就可以做到这一点:
如果你想保留函数名,你可以将替换模式改为例如
\1_\2_\3
。dwbf0jvd3#
一个简单的方法是在聚合函数之后使用别名。
列名将采用“original_column_name_aliased_column_name”的形式。对于您的情况,original_column_name将为201601,aliased_column_name将为avg_cost,列名为201601_avg_cost(通过下划线“_”链接)。
6jjcrrmo4#
写了一个简单快捷的函数来做到这一点。享受吧!:)
wlwcrazw5#
从zero323修改的版本,用于spark 2.4
当前列名类似于
0_first(is_flashsale, false): int
emeijp436#
您可以更改pivot之前的列名称。
.withColumn(“ship”,F.concat(F.lit(“ship_”),“ship”))