将函数应用于rdd中的列(python、spark)

svgewumm 于 2021-07-09 发布在 Spark

关注(0)|答案(1)|浏览(384)

这个问题在这里已经有答案了：

pyspark-对数组（integer（））类型列中的值求和的最佳方法（5个答案）
20天前关门了。
这是我的rdd：

id|               arr |
+--------------------+-
|1|  [8,5,1,11,10,8,2]|
|2|    [3,6,3,1,0,1,2]|
|3|    [4,2,2,0,1,1,3]|
|4|    [0,0,0,0,0,2,0]|
|5|    [3,4,7,3,2,1,2]|
|6|    [1,0,1,0,6,0,0]|
|7|    [2,1,2,2,9,3,0]|
|8|    [3,2,2,3,1,0,3]|
|9| [1,1,7,12,11,5,5]|

我正在研究如何应用一个函数，对列表中的所有数字求和，并在单独的列中返回sum。这是我的函数（我使用python）。它在一个数组上工作，但我不知道如何将它应用于rdd中的列。

def sum_func(x):
  t = 0
  for i in range(0, len(x)):
    t = t + x[i]
  return t == 0

rdd python apache-spark pyspark List

来源：https://stackoverflow.com/questions/66961856/apply-function-to-a-column-in-rdd-python-spark

1条答案

按热度按时间

5vf7fwbs1#

为了将其应用于Dataframe上的列，可以创建并应用用户定义函数（udf）。

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

def sum_func(x):
  t = 0
  for i in range(0, len(x)):
    t = t + x[i]
  return t

# Creating the UDF with return type Integer

sum_func_udf = udf(sum_func,IntegerType())

然后在Dataframe上（假设它存储在 df )，我们使用 withColumn 添加另一列

df = df.withColumn(
   sum_func_udf(df.arr).alias("sum")
)

赞(0）回复(0）举报 2021-07-09

我来回答

将函数应用于rdd中的列(python、spark)

1条答案

相关问题

热门标签

最新问答