pandas UDF引发了异常错误:'属性错误:'numpy.ndarray'对象没有属性'array'

wnavrhmk  于 2022-12-09  发布在  其他
关注(0)|答案(1)|浏览(175)

我正在使用panda_udf()在PySpark Dataframe 上运行python函数。
Python异常错误:UDF引发了异常错误:'属性错误:'numpy.ndarray'对象没有属性'array "

from pyspark.sql import functions as F
import pyspark.sql.types as T
import pandas as pd
import numpy as np
from scipy import stats

df = sqlContext.createDataFrame( 
    [(25, 20, .25), 
    (20, 20, .22), 
    (35, 20, .67)], 
    ["control_mean", "control_sd", "pooled_se"]
)

df.show()

def foo(control_mean: pd.Series, control_sd: pd.Series, pooled_se: pd.Series) -> pd.Series:
    
    mu_null = 0

    ##Calculate Likelihood of Null
    pdf = stats.norm.pdf(control_mean, mu_null, pooled_se)
 
    return(pdf)

foo_pudf = F.pandas_udf(foo, returnType=T.FloatType())

df.withColumn(
    "pdf", 
    foo_pudf(
        F.col("control_mean"), 
        F.col("control_sd"), 
        F.col("pooled_se")
    )
).show()

来自stats.norm.pdf的输出似乎触发了错误。这个输出是numpy.float64类型的。但是我可以在其他panda_udf中使用来自np.sqrt()numpy.float64输出而没有问题。所以我不确定是什么导致了这里的错误。

vlju58qv

vlju58qv1#

从pandas_udf返回一个pandas系列:

@F.pandas_udf(T.FloatType())
def foo(control_mean: pd.Series, control_sd: pd.Series, pooled_se: pd.Series) -> pd.Series:
    mu_null = 0
    ##Calculate Likelihood of Null
    pdf = stats.norm.pdf(control_mean, mu_null, pooled_se)
    return pd.Series(pdf)

df.withColumn(
    "pdf", 
    foo(
        F.col("control_mean"), 
        F.col("control_sd"), 
        F.col("pooled_se")
    )
).show()

+------------+----------+---------+---+
|control_mean|control_sd|pooled_se|pdf|
+------------+----------+---------+---+
|          25|        20|     0.25|0.0|
|          20|        20|     0.22|0.0|
|          35|        20|     0.67|0.0|
+------------+----------+---------+---+

调试panda_udf的提示

将pandas_udf的返回类型更改为StringType,并返回长度等于行数的序列。对于返回序列的每个元素,可以使用文本格式的调试条目,也可以将其保留为空字符串。
例如:在下面的输出中,我们将打印

  • pdf的类型
  • pdf的形状
  • pdf的字符串表示
@F.pandas_udf(T.StringType())
def foo(control_mean: pd.Series, control_sd: pd.Series, pooled_se: pd.Series) -> pd.Series:
    mu_null = 0
    ##Calculate Likelihood of Null
    pdf = stats.norm.pdf(control_mean, mu_null, pooled_se)
    return pd.Series([str(type(pdf)), str(pdf.shape), str(pdf)])

df.withColumn(
    "pdf", 
    foo(
        F.col("control_mean"), 
        F.col("control_sd"), 
        F.col("pooled_se")
    )
).show(truncate=False)

+------------+----------+---------+-----------------------+
|control_mean|control_sd|pooled_se|pdf                    |
+------------+----------+---------+-----------------------+
|25          |20        |0.25     |<class 'numpy.ndarray'>|
|20          |20        |0.22     |(3,)                   |
|35          |20        |0.67     |[0. 0. 0.]             |
+------------+----------+---------+-----------------------+

相关问题