pyspark 如何使用Spark或PandasDataFrame在4列分组的基础上将平均值插入到特定列中?[duplicate]

bttbmeg0  于 2023-02-03  发布在  Spark
关注(0)|答案(1)|浏览(112)
    • 此问题在此处已有答案**:

Pandas: filling missing values by mean in each group(12个答案)
18小时前关门了。
我在spark Dataframe 中有以下值

+---------+------+----+----------+-----------+-----------+--------------+
|store_nbr|metric|goal|time_frame|time_period|fiscal_year|       channel|
+---------+------+----+----------+-----------+-----------+--------------+
|     1000|   NPS|  53| Half Year|         H1|       2023|         store|
|     1001|   NPS|  81| Half Year|         H1|       2023|ecomm_combined|
|     1003|   NPS|  65| Half Year|         H1|       2023|ecomm_combined|
|     1004|   NPS|  85| Half Year|         H1|       2023|         store|
|     1007|   NPS|  53| Half Year|         H1|       2023|         store|
|     1008|   NPS|null| Half Year|         H1|       2023|         store|
|     1009|   NPS|  85| Half Year|         H1|       2023|         store|
|     1011|   NPS|  72| Half Year|         H1|       2023|         store|
|     1012|   NPS|  71| Half Year|         H1|       2023|ecomm_combined|
|     1013|   NPS|  52| Half Year|         H1|       2023|ecomm_combined|
|     1014|   NPS|null| Half Year|         H1|       2023|ecomm_combined|
|     1016|   NPS|  54| Half Year|         H1|       2023|ecomm_combined|
|     1017|   NPS|  69| Half Year|         H1|       2023|ecomm_combined|
|     1018|   NPS|  93| Half Year|         H1|       2023|ecomm_combined|
|     1020|   NPS|  93| Half Year|         H1|       2023|         store|
|     1022|   NPS|  95| Half Year|         H1|       2023|         store|
|     1023|   NPS|  86| Half Year|         H1|       2023|ecomm_combined|
|     1025|   NPS|  72| Half Year|         H1|       2023|ecomm_combined|
|     1026|   NPS|  70| Half Year|         H1|       2023|ecomm_combined|
|     1027|   NPS|null| Half Year|         H1|       2023|ecomm_combined|
|     1028|   NPS|  63| Half Year|         H1|       2023|ecomm_combined|
|     1029|   NPS|  66| Half Year|         H1|       2023|ecomm_combined|
|     1030|   NPS|  86| Half Year|         H1|       2023|ecomm_combined|
|     1031|   NPS|  61| Half Year|         H1|       2023|ecomm_combined|
|     1032|   NPS|  96| Half Year|         H1|       2023|ecomm_combined|
|     1033|   NPS|  91| Half Year|         H1|       2023|ecomm_combined|
|     1034|   NPS|  79| Half Year|         H1|       2023|ecomm_combined|
|     1035|   NPS|  53| Half Year|         H1|       2023|ecomm_combined|
|     1036|   NPS|null| Half Year|         H1|       2023|         store|

我的平均计算 Dataframe 看起来像-

goal = raw_df.groupBy('metric','time_frame', 'time_period','fiscal_year','channel').mean('goal')

+------+----------+-----------+-----------+--------------+-----------------+
|metric|time_frame|time_period|fiscal_year|       channel|        avg(goal)|
+------+----------+-----------+-----------+--------------+-----------------+
|  null|      null|       null|       null|          null|             null|
|   NPS| Half Year|         H1|       2023|ecomm_combined|75.24033149171271|
|   NPS| Half Year|         H1|       2023|         store|             78.0|
+------+----------+-----------+-----------+--------------+-----------------+

所以我想在raw_df数据中目标列的空值处插入这个计算出来的平均值(数据类型无关紧要)。按指标、time_frame、time_period、fiscal_year、channel这些列分组。我如何在Spark或Pandas dataframe中实现这一点呢?

3npbholx

3npbholx1#

使用pandas时:groupby.transform和布尔索引:

cols = ['metric','time_frame', 'time_period','fiscal_year','channel']

raw_df.loc[raw_df['goal'].isna(), 'goal'] = (raw_df.groupby(cols)
                                             ['goal'].transform('mean')
                                             )

或者使用fillna

cols = ['metric','time_frame', 'time_period','fiscal_year','channel']

raw_df['goal'] = raw_df['goal'].fillna(raw_df.groupby(cols)
                                       ['goal'].transform('mean')
                                       )

输出:

store_nbr metric       goal time_frame time_period  fiscal_year         channel
0        1000    NPS  53.000000  Half Year          H1         2023           store
1        1001    NPS  81.000000  Half Year          H1         2023  ecomm_combined
2        1003    NPS  65.000000  Half Year          H1         2023  ecomm_combined
3        1004    NPS  85.000000  Half Year          H1         2023           store
4        1007    NPS  53.000000  Half Year          H1         2023           store
5        1008    NPS  76.571429  Half Year          H1         2023           store
6        1009    NPS  85.000000  Half Year          H1         2023           store
7        1011    NPS  72.000000  Half Year          H1         2023           store
8        1012    NPS  71.000000  Half Year          H1         2023  ecomm_combined
9        1013    NPS  52.000000  Half Year          H1         2023  ecomm_combined
10       1014    NPS  72.666667  Half Year          H1         2023  ecomm_combined
11       1016    NPS  54.000000  Half Year          H1         2023  ecomm_combined
12       1017    NPS  69.000000  Half Year          H1         2023  ecomm_combined
13       1018    NPS  93.000000  Half Year          H1         2023  ecomm_combined
14       1020    NPS  93.000000  Half Year          H1         2023           store
15       1022    NPS  95.000000  Half Year          H1         2023           store
16       1023    NPS  86.000000  Half Year          H1         2023  ecomm_combined
17       1025    NPS  72.000000  Half Year          H1         2023  ecomm_combined
18       1026    NPS  70.000000  Half Year          H1         2023  ecomm_combined
19       1027    NPS  72.666667  Half Year          H1         2023  ecomm_combined
20       1028    NPS  63.000000  Half Year          H1         2023  ecomm_combined
21       1029    NPS  66.000000  Half Year          H1         2023  ecomm_combined
22       1030    NPS  86.000000  Half Year          H1         2023  ecomm_combined
23       1031    NPS  61.000000  Half Year          H1         2023  ecomm_combined
24       1032    NPS  96.000000  Half Year          H1         2023  ecomm_combined
25       1033    NPS  91.000000  Half Year          H1         2023  ecomm_combined
26       1034    NPS  79.000000  Half Year          H1         2023  ecomm_combined
27       1035    NPS  53.000000  Half Year          H1         2023  ecomm_combined
28       1036    NPS  76.571429  Half Year          H1         2023           store

相关问题