pandas spark或panda中数组列的中值,同时计算所有行

8mmmxcuj  于 2022-12-02  发布在  Spark
关注(0)|答案(2)|浏览(130)

Strangely enough I cant find any where on the internet if its possible to be done.
I have a datafrme of array column.

arr_col
[1,3,4]
[4,3,5]

I want result

Result
3
4

I want the median for each row.
I managed to do it with a pandas udf but it iterates the column and applies np.median to each row. .
I dont want it as it's slow and tow at a time. I want it to act at all rows the same time.
Either in pandas or pyspark

6jygbczu

6jygbczu1#

使用numpy

import numpy as np
df['Result'] = np.median(np.vstack(df['arr_col']), axis=1)

或者explodegroupby.median

df['Result'] = (df['arr_col'].explode()
                 .groupby(level=0).median()
                )

输出量:

arr_col  Result
0  [1, 3, 4]     3.0
1  [4, 3, 5]     4.0

使用的输入:

df = pd.DataFrame({'arr_col': [[1,3,4], [4,3,5]]})
bmp9r5qi

bmp9r5qi2#

可以在pyspark中使用udf。

m =udf(lambda x: int(np.median(x)),IntegerType())
df.withColumn('Result', m(col('arr_col'))).show()

+---+---------+------+
| Id|  arr_col|Result|
+---+---------+------+
|  1|[1, 3, 4]|   3.0|
|  1|[4, 3, 6]|   4.0|
+---+---------+------+

相关问题