pyspark 计算透视表的列百分比

qc6wkl3g  于 2022-12-22  发布在  Spark
关注(0)|答案(1)|浏览(162)

让我们使用以下csv创建一个pyspakr Dataframe :

desc,B1,B2,B3,B4,B5,B6,B7,B8,B9,B10,B11
Other,111957.0,35293.0,225852.0,35110.0,1023680.0,448736.0,256473.0,269856.0,306668.0,8807.0,89551.0
Down,575614.0,203186.0,0.0,125056.0,0.0,766086.0,1157311.0,11127.0,88741.0,31603.0,300733.0
Up,0.0,0.0,1953645.0,0.0,346423.0,0.0,0.0,0.0,0.0,0.0,0.0
Same,2948065.0,730113.0,33121.0,668868.0,5451224.0,4485121.0,30780025.0,1977361.0,5295598.0,217697.0,1790024.0
Old,186596.0,88257.0,0.0,36842.0,2173626.0,240619.0,0.0,2770.0,2212560.0,9865.0,121045.0
New,0.0,0.0,0.0,0.0,3148.0,0.0,97252.0,0.0,0.0,0.0,0.0

它是通过在 Dataframe 上使用透视创建的:

y = x.groupby('desc').pivot('prev_segment').sum('cust_count')

现在我需要将每个B*列的值转换为百分比,百分比是通过将列中所有值相加,然后将每个单元格除以总和得出的,因此列的总和为100%
如果有人能给予我一个简单的方法来完成这个任务,我将不胜感激,最好是作为透视表本身中的agg函数的一部分。因此,除了sum('cust_count')之外,也许还有另一种更简单的方法来提供结果 Dataframe

olmpazwi

olmpazwi1#

使用窗口函数。将每一行除以该列的和。乘以100并舍入到0个小数位。colRegex将帮助您只选择具有B的行。

df.select("desc",*[round(col(x)/sum(x).over(Window.partitionBy())*100,0).astype('integer').alias(x) for x in df.select(df.colRegex("`^B.*`")).columns]).show()

+-----+---+---+---+---+---+---+---+---+---+---+---+
| desc| B1| B2| B3| B4| B5| B6| B7| B8| B9|B10|B11|
+-----+---+---+---+---+---+---+---+---+---+---+---+
|Other|  3|  3| 10|  4| 11|  8|  1| 12|  4|  3|  4|
| Down| 15| 19|  0| 14|  0| 13|  4|  0|  1| 12| 13|
|   Up|  0|  0| 88|  0|  4|  0|  0|  0|  0|  0|  0|
| Same| 77| 69|  1| 77| 61| 75| 95| 87| 67| 81| 78|
|  Old|  5|  8|  0|  4| 24|  4|  0|  0| 28|  4|  5|
|  New|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
+-----+---+---+---+---+---+---+---+---+---+---+---+

相关问题