pandas spark或panda中数组列的中值，同时计算所有行

8mmmxcuj 于 2022-12-02 发布在 Spark

关注(0)|答案(2)|浏览(130)

Strangely enough I cant find any where on the internet if its possible to be done.
I have a datafrme of array column.

arr_col
[1,3,4]
[4,3,5]

I want result

Result
3
4

I want the median for each row.
I managed to do it with a pandas udf but it iterates the column and applies np.median to each row. .
I dont want it as it's slow and tow at a time. I want it to act at all rows the same time.
Either in pandas or pyspark

pandas

来源：https://stackoverflow.com/questions/74605247/median-of-an-array-column-in-spark-or-pandas-all-rows-simultaneously

2条答案

按热度按时间

6jygbczu1#

使用numpy

import numpy as np
df['Result'] = np.median(np.vstack(df['arr_col']), axis=1)

或者explode和groupby.median：

df['Result'] = (df['arr_col'].explode()
                 .groupby(level=0).median()
                )

输出量：

arr_col  Result
0  [1, 3, 4]     3.0
1  [4, 3, 5]     4.0

使用的输入：

df = pd.DataFrame({'arr_col': [[1,3,4], [4,3,5]]})

赞(0）回复(0）举报 2022-12-02

bmp9r5qi2#

可以在pyspark中使用udf。

m =udf(lambda x: int(np.median(x)),IntegerType())
df.withColumn('Result', m(col('arr_col'))).show()

+---+---------+------+
| Id|  arr_col|Result|
+---+---------+------+
|  1|[1, 3, 4]|   3.0|
|  1|[4, 3, 6]|   4.0|
+---+---------+------+

赞(0）回复(0）举报 2022-12-02

我来回答

pandas spark或panda中数组列的中值，同时计算所有行

2条答案

相关问题

热门标签

最新问答