基于类方法创建pyspark dataframe列

o4hqfura 于 2021-05-17 发布在 Spark

关注(0)|答案(1)|浏览(514)

我有一个python类，它有如下函数：

class Features():
    def __init__(self, json):
        self.json = json

    def get_email(self):
        email = self.json.get('fields', {}).get('email', None)
        return email

我正尝试在pyspark数据框架中使用get\u email函数，基于另一列“raw\u json”创建一个新列，该列由json值组成：

df = data.withColumn('email', (F.udf(lambda j: Features.get_email(json.loads(j)), t.StringType()))('raw_json'))

因此，理想的pysparkDataframe如下所示：

+---------------+-----------
 |raw_json         |email
 +----------------+----------
 |                 |  
 +----------------+--------
 |                 |  
 +----------------+-------

但我得到了一个错误，说：

TypeError: unbound method get_email() must be called with Features instance as first argument (got dict instance instead)

我该怎么做才能做到这一点？
我以前也见过类似的问题，但没有解决。

python apache-spark pyspark Function Class

来源：https://stackoverflow.com/questions/64791959/create-pyspark-dataframe-column-based-on-class-method

1条答案

按热度按时间

bxgwgixi1#

我猜您误解了类在python中的用法。你可能在找这个：

udf = F.udf(lambda j: Features(json.loads(j)).get_email())
df = data.withColumn('email', udf('raw_json'))

在这里你示例化一个 Features 对象并调用 get_email 对象的方法。

赞(0）回复(0）举报 2021-05-18

我来回答

基于类方法创建pyspark dataframe列

1条答案

相关问题

热门标签

最新问答