尝试在pyspark中复制sql语句，获取不可迭代的列

np8igboo 于 2023-02-05 发布在 Apache

关注(0)|答案(1)|浏览(83)

使用Pyspark将数据转换为DataFrame。旧的提取使用了以下SQL行：

case when location_type = 'SUPPLIER' then SUBSTRING(location_id,1,length(location_id)-3)

我引入数据并将其加载到DF中，然后尝试使用以下代码进行转换：

df = df.withColumn("location_id", F.when(df.location_type == "SUPPLIER",
                         F.substring(df.location_id, 1, length(df.location_id) - 3))
                         .otherwise(df.location_id))`

substring方法接受int作为第三个参数，但是length()方法给出了一个Column对象。我没有运气尝试强制转换它，也没有找到接受Column的方法。还尝试使用expr() Package 器，但是同样无法使其工作。
供应商ID看起来像12345-01。转换需要剥离-01。

apache-spark

来源：https://stackoverflow.com/questions/75338240/trying-to-replicate-a-sql-statement-in-pyspark-getting-column-not-iterable

1条答案

按热度按时间

du7egjpx1#

正如您提到的，您可以使用expr来将substring与来自其他列的索引一起使用，如下所示：

df = df.withColumn("location_id",
    F.when(df.location_type == "SUPPLIER",
        F.expr("substring(location_id, 1, length(location_id) - 3)")
    ).otherwise(df.location_id)
)

赞(0）回复(0）举报 2023-02-05

我来回答

尝试在pyspark中复制sql语句，获取不可迭代的列

1条答案

相关问题

热门标签

最新问答