udf在scala中提取字符串

bprjcwpo  于 2021-05-18  发布在  Spark
关注(0)|答案(2)|浏览(675)

我正在尝试从此数据类型中提取最后一个集合编号:

urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)

在这个例子中,我试图提取 10342800535 作为一根弦。
这是我在scala中的代码,

def extractNestedUrn(urn: String): String = {
    val arr = urn.split(":").map(_.trim)
    val nested = arr(3)
    val clean = nested.substring(1, nested.length -1)
    val subarr = clean.split(":").map(_.trim)
    val res = subarr(3)
    val out = res.split(",").map(_.trim)
    val fin = out(1)
    fin.toString
  }

这是作为自定义项运行的,它抛出以下错误,

org.apache.spark.SparkException: Failed to execute user defined function

我做错什么了?

b4lqfgs4

b4lqfgs41#

您可以简单地使用regexp\u extract函数。看看这个

val df = Seq(("urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)")).toDF("x")

df.show(false)
+-------------------------------------------------------------------+
|x                                                                  |
+-------------------------------------------------------------------+
|urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)|
+-------------------------------------------------------------------+

df.withColumn("NestedUrn", regexp_extract(col("x"), """.*,(\d+)""", 1)).show(false)
+-------------------------------------------------------------------+-----------+
|x                                                                  |NestedUrn  |
+-------------------------------------------------------------------+-----------+
|urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)|10342800535|
+-------------------------------------------------------------------+-----------+
dfty9e19

dfty9e192#

其中一个原因是 org.apache.spark.SparkException: Failed to execute user defined function 引发异常是指在用户定义函数中引发异常。

分析

如果我尝试使用您提供的示例输入运行您的用户定义函数,请使用以下代码:

import org.apache.spark.sql.functions.{col, udf}
import sparkSession.implicits._

val dataframe = Seq("urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)").toDF("urn")

def extractNestedUrn(urn: String): String = {
  val arr = urn.split(":").map(_.trim)
  val nested = arr(3)
  val clean = nested.substring(1, nested.length -1)
  val subarr = clean.split(":").map(_.trim)
  val res = subarr(3)
  val out = res.split(",").map(_.trim)
  val fin = out(1)
  fin.toString
}

val extract_urn = udf(extractNestedUrn _)

dataframe.select(extract_urn(col("urn"))).show(false)

我得到了完整的堆栈跟踪:

Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined function(UdfExtractionError$$$Lambda$1165/1699756582: (string) => string)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130)
  at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
  ...
  at UdfExtractionError$.main(UdfExtractionError.scala:37)
  at UdfExtractionError.main(UdfExtractionError.scala)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
  at UdfExtractionError$.extractNestedUrn$1(UdfExtractionError.scala:29)
  at UdfExtractionError$.$anonfun$main$4(UdfExtractionError.scala:35)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:157)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1127)
  ... 86 more

这个堆栈跟踪的重要部分实际上是:

Caused by: java.lang.ArrayIndexOutOfBoundsException: 3

这是在执行用户定义的函数代码时引发的异常。如果我们分析函数代码,则将输入拆分两倍 : . 第一次拆分的结果实际上是这个数组:

["urn", "fb", "candidateHiringState", "(urn", "fb", "contract", "187236028,10342800535)"]

而不是这个数组:

["urn", "fb", "candidateHiringState", "(urn:fb:contract:187236028,10342800535)"]

因此,如果我们执行函数的其余语句,您将得到:

val arr = ["urn", "fb", "candidateHiringState", "(urn", "fb", "contract", "187236028,10342800535)"]
val nested = "(urn"
val clean = "urn"
val subarr = ["urn"]

在下一行调用数组的第四个元素 subarr 只包含一个元素的 ArrayOutOfBound 引发异常,然后spark返回 SparkException ###解决方案
尽管您的问题的最佳解决方案显然是使用regexp\u extract的上一个答案,但您可以按以下方式更正您的用户定义函数:

def extractNestedUrn(urn: String): String = {
  val arr = urn.split(':') // split using character instead of string regexp
  val nested = arr.last // get last element of array, here "187236028,10342800535)"
  val subarr = nested.split(',')
  val res = subarr.last // get last element, here "10342800535)"
  val out = res.init // take all the string except the last character, to remove ')'
  out // no need to use .toString as out is already a String
}

然而,如前所述,最好的解决方案是使用spark内部函数 regexp_extract 正如第一个答案所解释的。您的代码将更容易理解,性能更高

相关问题