- 此问题在此处已有答案**:
Pandas: filling missing values by mean in each group(12个答案)
18小时前关门了。
我在spark Dataframe 中有以下值
+---------+------+----+----------+-----------+-----------+--------------+
|store_nbr|metric|goal|time_frame|time_period|fiscal_year| channel|
+---------+------+----+----------+-----------+-----------+--------------+
| 1000| NPS| 53| Half Year| H1| 2023| store|
| 1001| NPS| 81| Half Year| H1| 2023|ecomm_combined|
| 1003| NPS| 65| Half Year| H1| 2023|ecomm_combined|
| 1004| NPS| 85| Half Year| H1| 2023| store|
| 1007| NPS| 53| Half Year| H1| 2023| store|
| 1008| NPS|null| Half Year| H1| 2023| store|
| 1009| NPS| 85| Half Year| H1| 2023| store|
| 1011| NPS| 72| Half Year| H1| 2023| store|
| 1012| NPS| 71| Half Year| H1| 2023|ecomm_combined|
| 1013| NPS| 52| Half Year| H1| 2023|ecomm_combined|
| 1014| NPS|null| Half Year| H1| 2023|ecomm_combined|
| 1016| NPS| 54| Half Year| H1| 2023|ecomm_combined|
| 1017| NPS| 69| Half Year| H1| 2023|ecomm_combined|
| 1018| NPS| 93| Half Year| H1| 2023|ecomm_combined|
| 1020| NPS| 93| Half Year| H1| 2023| store|
| 1022| NPS| 95| Half Year| H1| 2023| store|
| 1023| NPS| 86| Half Year| H1| 2023|ecomm_combined|
| 1025| NPS| 72| Half Year| H1| 2023|ecomm_combined|
| 1026| NPS| 70| Half Year| H1| 2023|ecomm_combined|
| 1027| NPS|null| Half Year| H1| 2023|ecomm_combined|
| 1028| NPS| 63| Half Year| H1| 2023|ecomm_combined|
| 1029| NPS| 66| Half Year| H1| 2023|ecomm_combined|
| 1030| NPS| 86| Half Year| H1| 2023|ecomm_combined|
| 1031| NPS| 61| Half Year| H1| 2023|ecomm_combined|
| 1032| NPS| 96| Half Year| H1| 2023|ecomm_combined|
| 1033| NPS| 91| Half Year| H1| 2023|ecomm_combined|
| 1034| NPS| 79| Half Year| H1| 2023|ecomm_combined|
| 1035| NPS| 53| Half Year| H1| 2023|ecomm_combined|
| 1036| NPS|null| Half Year| H1| 2023| store|
我的平均计算 Dataframe 看起来像-
goal = raw_df.groupBy('metric','time_frame', 'time_period','fiscal_year','channel').mean('goal')
+------+----------+-----------+-----------+--------------+-----------------+
|metric|time_frame|time_period|fiscal_year| channel| avg(goal)|
+------+----------+-----------+-----------+--------------+-----------------+
| null| null| null| null| null| null|
| NPS| Half Year| H1| 2023|ecomm_combined|75.24033149171271|
| NPS| Half Year| H1| 2023| store| 78.0|
+------+----------+-----------+-----------+--------------+-----------------+
所以我想在raw_df数据中目标列的空值处插入这个计算出来的平均值(数据类型无关紧要)。按指标、time_frame、time_period、fiscal_year、channel这些列分组。我如何在Spark或Pandas dataframe中实现这一点呢?
1条答案
按热度按时间3npbholx1#
使用pandas时:
groupby.transform
和布尔索引:或者使用
fillna
:输出: