mann whitney u测试pyspark

0yg35tkg  于 2021-07-13  发布在  Spark
关注(0)|答案(1)|浏览(240)

我有两个Dataframe,如下所示:

+----------+------------+-----------+
| ddos_date|count_before|ColumnIndex|
+----------+------------+-----------+
|2017-12-30|          88|          0|
|2017-12-30|         129|          1|
|2017-12-30|          98|          2|
|2017-12-30|          90|          3|
|2017-12-30|          80|          4|
|2017-12-30|          84|          5|
|2017-12-30|         158|          6|
|2018-01-01|          98|          7|
|2018-01-01|          90|          8|
|2018-01-01|          80|          9|
|2018-01-01|          84|         10|
|2018-01-01|         158|         11|
|2018-01-01|          74|         12|
|2018-01-01|         162|         13|
+----------+------------+-----------+

+----------+-----------+-----------+
| ddos_date|count_after|ColumnIndex|
+----------+-----------+-----------+
|2017-12-30|         60|          0|
|2017-12-30|        117|          1|
|2017-12-30|        167|          2|
|2017-12-30|         88|          3|
|2017-12-30|        158|          4|
|2017-12-30|         74|          5|
|2017-12-30|        162|          6|
|2017-12-30|        144|          7|
|2018-01-01|        167|          8|
|2018-01-01|         88|          9|
|2018-01-01|        129|         10|
|2018-01-01|         98|         11|
|2018-01-01|        162|         12|
|2018-01-01|        144|         13|
|2018-01-01|        116|         14|
|2018-01-01|         82|         15|
+----------+-----------+-----------+

我加入了这些Dataframe,结果如下所示:

+------------+----------+-----------+----------+
|count_before|      date|count_after|      date|
+------------+----------+-----------+----------+
|          88|2017-12-30|         60|2017-12-30|
|          98|2018-01-01|        144|2017-12-30|
|         158|2017-12-30|        162|2017-12-30|
|          80|2018-01-01|         88|2018-01-01|
|          84|2017-12-30|         74|2017-12-30|
|         129|2017-12-30|        117|2017-12-30|
|          84|2018-01-01|        129|2018-01-01|
|          90|2017-12-30|         88|2017-12-30|
|          74|2018-01-01|        162|2018-01-01|
|          90|2018-01-01|        167|2018-01-01|
|         158|2018-01-01|         98|2018-01-01|
|          98|2017-12-30|        167|2017-12-30|
|          80|2017-12-30|        158|2017-12-30|
|         162|2018-01-01|        144|2018-01-01|
|        null|      null|        116|2018-01-01|
|        null|      null|         82|2018-01-01|
+------------+----------+-----------+----------+

目的是按“日期”分组,对两个不同大小的样本进行mann-whitney检验:计数前和计数后。
请帮忙。
有关Dataframe的进一步引用将附加到列表中,然后合并到单个Dataframe中:

df1_list = []
df2_list = []

for d in date_list:
    df1, df2 = hdf(df, d)

    df1_list.append(df1)
    df2_list.append(df2)

df1 = reduce(DataFrame.unionAll, df1_list)
df2 = reduce(DataFrame.unionAll, df2_list)
dgsult0t

dgsult0t1#

我不熟悉mann-whitney的u测试,但下面是我的尝试。如果这是你想计算的,告诉我。我使用了scipy的函数。

from scipy.stats import mannwhitneyu
import pyspark.sql.functions as F

result = df1.join(
    df2, ['ddos_date', 'ColumnIndex']
).drop('ColumnIndex').groupBy('ddos_date').agg(
    F.collect_list(F.array('count_before', 'count_after')).alias('arr')
).withColumn(
    'u_value', 
    F.udf(
        lambda arr:
            [float(j) for j in mannwhitneyu([i[0] for i in arr], [i[1] for i in arr])],
        'array<float>'
    )('arr')
)

result.show(truncate=False)
+----------+----------------------------------------------------------------------------+------------------+
|ddos_date |arr                                                                         |u_value           |
+----------+----------------------------------------------------------------------------+------------------+
|2018-01-01|[[90, 167], [80, 88], [84, 129], [158, 98], [74, 162], [162, 144]]          |[9.5, 0.09969863] |
|2017-12-30|[[88, 60], [129, 117], [98, 167], [90, 88], [80, 158], [84, 74], [158, 162]]|[21.0, 0.35042578]|
+----------+----------------------------------------------------------------------------+------------------+

相关问题