scala—在apachespark udf中将包含字符串的列转换为包含对象列表的列

8i9zcol2  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(380)

我有一个示例数据框:

val A = """[[15,["Printing Calculators"]],[13811,["Office Products"]]]"""
val B = """[[30888,["Paper & Printable Media"]],[223845,["Office Products"]]]"""
val C = """[[64,["Office Calculator Accessories"]]]"""

val df = List(A,B,C).toDF("bestseller_ranks")


我想创建一个列,如下所示:

case class BestSellerRank(
  Ranking: Integer,
  Category: String
)
val A2 = List(new BestSellerRank(15,"Printing Calculators"),new BestSellerRank(13811,"Office Products"))
val B2 = List(new BestSellerRank(30888,"Paper & Printable Media"),new BestSellerRank(223845,"Office Products"))
val C2 =List(new BestSellerRank(64,"Office Calculator Accessories"))

val df2 = List(A2,B2,C2).toDF("bestseller_ranks_transformed")


我曾尝试创建如下自定义项:

val BRUDF: UserDefinedFunction =
  udf(
    (bestseller_ranks: String) => {

      bestseller_ranks.split(",").fold(List.empty[BestSellerRank])(v => new BestSellerRank(v._1, v._2))
    }
  )

但这似乎完全是垃圾,我被卡住了。谢谢你的帮助!

j2qf4p5b

j2qf4p5b1#

我试着在没有自定义项的情况下实现这个。也许这是有帮助的

加载提供的测试数据

val A = """[[15,["Printing Calculators"]],[13811,["Office Products"]]]"""
    val B = """[[30888,["Paper & Printable Media"]],[223845,["Office Products"]]]"""
    val C = """[[64,["Office Calculator Accessories"]]]"""

    val df = List(A,B,C).toDF("bestseller_ranks")
    df.show(false)
    df.printSchema()
    /**
      * +------------------------------------------------------------------+
      * |bestseller_ranks                                                  |
      * +------------------------------------------------------------------+
      * |[[15,["Printing Calculators"]],[13811,["Office Products"]]]       |
      * |[[30888,["Paper & Printable Media"]],[223845,["Office Products"]]]|
      * |[[64,["Office Calculator Accessories"]]]                          |
      * +------------------------------------------------------------------+
      *
      * root
      * |-- bestseller_ranks: string (nullable = true)
      */

转换字符串->数组[struct]

val  p = df.withColumn("arr", split(
      translate(
        regexp_replace($"bestseller_ranks", """\]\s*,\s*\[""", "##"), "][", ""
      ), "##"
    ))

    val processed = p.withColumn("bestseller_ranks_transformed", expr("TRANSFORM(arr, x -> " +
      "named_struct('Ranking', cast(split(x, ',')[0] as int), 'Category', split(x, ',')[1]))"))
        .select("bestseller_ranks", "bestseller_ranks_transformed")
    processed.show(false)
    processed.printSchema()

    /**
      * +------------------------------------------------------------------+-----------------------------------------------------------------+
      * |bestseller_ranks                                                  |bestseller_ranks_transformed                                     |
      * +------------------------------------------------------------------+-----------------------------------------------------------------+
      * |[[15,["Printing Calculators"]],[13811,["Office Products"]]]       |[[15, "Printing Calculators"], [13811, "Office Products"]]       |
      * |[[30888,["Paper & Printable Media"]],[223845,["Office Products"]]]|[[30888, "Paper & Printable Media"], [223845, "Office Products"]]|
      * |[[64,["Office Calculator Accessories"]]]                          |[[64, "Office Calculator Accessories"]]                          |
      * +------------------------------------------------------------------+-----------------------------------------------------------------+
      *
      * root
      * |-- bestseller_ranks: string (nullable = true)
      * |-- bestseller_ranks_transformed: array (nullable = true)
      * |    |-- element: struct (containsNull = false)
      * |    |    |-- Ranking: integer (nullable = true)
      * |    |    |-- Category: string (nullable = true)
      */
jgwigjjp

jgwigjjp2#

以下是我的解决方案:

val BRUDF: UserDefinedFunction =
  udf(
    (bestseller_ranks: String) => {

      if (bestseller_ranks != null) {
        bestseller_ranks.split("]],").map(v => v.replace("[","").replace("]]]","")).map(k => new BestSellerRank(k.split(",")(0).toInt,k.split(",")(1)))
      }else{
        null
      }
    }
  )

相关问题