我在databricks上使用pyspark.ml.clustering训练并保存了一个lda模型,现在我需要使用新数据预测主题。但是,当我需要使用预测结果时,我得到了一个错误。
这是输入数据模式( tokenizedText
Dataframe):
ID:string
Year:integer
TypeComment:string
NewText:string
ExecutionName:string
ExecutionTime:string
Tokens:array
element:string
这是要训练的代码摘要:
from pyspark.ml.feature import IDF, HashingTF, Tokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.clustering import LDA, LDAModel, LocalLDAModel
counter = CountVectorizer(inputCol="Tokens", outputCol="term_frequency", minDF=10)
counterModel = counter.fit(tokenizedText)
vectorizedLaw= counterModel.transform(tokenizedText)
lda_tf = LDA(k=6, maxIter=100, featuresCol="term_frequency", seed=135)
model_TF = lda_tf.fit(vectorizedLaw)
然后,我预测列车数据,一切正常:
predictions_TF = model_TF.transform(vectorizedLaw)
predictions_TF.select("topicDistribution").show(5, truncate=False)
+---------------------------------------------------------------------------------------------------------------------------+
|topicDistribution |
+---------------------------------------------------------------------------------------------------------------------------+
|[0.013571512425910889,0.6217200752455205,0.2961210273974943,0.02637133808190742,0.01725914968266371,0.02495689716650296] |
|[0.05687289141286662,0.06042583918761498,0.07003525643520062,0.10832523389472587,0.072220782376337,0.6321199966932549] |
|[0.021911946837097802,0.02328240957204057,0.02699809068833656,0.8610644212787785,0.027841122709084537,0.038902008914662105]|
|[0.004887677638064053,0.0051942680450804005,0.00600826758941306,0.009373274444250117,0.4726839053470049,0.5018526069361875]|
|[0.013570322581255357,0.014437643205896269,0.01669607176655056,0.02612486131653466,0.6240083606668544,0.30516274046290875] |
+---------------------------------------------------------------------------------------------------------------------------+
only showing top 5 rows
所以我决定保存模型:
model_TF.save('/dbfs/mnt/docs/model6_4_pyspark')
最后,我创建了预测新注解的代码。我加载了模型,并在新文本上重复相同的步骤(我非常确定新数据的模式等于训练df):
from pyspark.ml.feature import IDF, HashingTF, Tokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.clustering import LDA, LDAModel, LocalLDAModel
lda = LocalLDAModel.load('/dbfs/mnt/docs/model6_4_pyspark')
counter = CountVectorizer(inputCol="Tokens", outputCol="term_frequency", minDF=10)
counterModel = counter.fit(tokenizedText)
vectorizedLaw= counterModel.transform(tokenizedText)
predictions = lda.transform(vectorizedLaw)
predictions.select("topicDistribution").show(5)
但是,我有一个错误:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 215.0 failed 1 times, most recent failure: Lost task 0.0 in stage 215.0 (TID 601, ip-10-172-225-237.us-west-2.compute.internal, executor driver): org.apache.spark.SparkException: Failed to execute user defined function(LDAModel$$Lambda$5764/1348803876: (struct<type:tinyint,size:int,indices:array<int>,values:array<double>>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
这是完全错误:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<command-464806927610545> in <module>
----> 1 predictions.select("topicDistribution").show(5)
/databricks/spark/python/pyspark/sql/dataframe.py in show(self, n, truncate, vertical)
439 """
440 if isinstance(truncate, bool) and truncate:
--> 441 print(self._jdf.showString(n, 20, vertical))
442 else:
443 print(self._jdf.showString(n, int(truncate), vertical))
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(*a,**kw)
126 def deco(*a,**kw):
127 try:
--> 128 return f(*a,**kw)
129 except py4j.protocol.Py4JJavaError as e:
130 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o2276.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 215.0 failed 1 times, most recent failure: Lost task 0.0 in stage 215.0 (TID 601, ip-10-172-225-237.us-west-2.compute.internal, executor driver): org.apache.spark.SparkException: Failed to execute user defined function(LDAModel$$Lambda$5764/1348803876: (struct<type:tinyint,size:int,indices:array<int>,values:array<double>>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
at org.apache.spark.scheduler.Task.run(Task.scala:117)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$11(Executor.scala:657)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:660)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IndexOutOfBoundsException: (4731,0) not in [-4157,4157) x [-6,6)
at breeze.linalg.DenseMatrix$mcD$sp.apply$mcD$sp(DenseMatrix.scala:106)
at breeze.linalg.DenseMatrix$mcD$sp.apply(DenseMatrix.scala:103)
at breeze.linalg.DenseMatrix$mcD$sp.apply(DenseMatrix.scala:52)
at breeze.linalg.Matrix.apply(Matrix.scala:44)
at breeze.linalg.Matrix.apply$(Matrix.scala:44)
at breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:52)
at breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:52)
at breeze.linalg.SliceMatrix.apply(SliceMatrix.scala:23)
at breeze.linalg.Matrix.$anonfun$toDenseMatrix$1(Matrix.scala:125)
at breeze.linalg.Matrix.$anonfun$toDenseMatrix$1$adapted(Matrix.scala:124)
at breeze.linalg.MatrixConstructors.$anonfun$tabulate$2(Matrix.scala:230)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at breeze.linalg.MatrixConstructors.$anonfun$tabulate$1(Matrix.scala:229)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at breeze.linalg.MatrixConstructors.tabulate(Matrix.scala:229)
at breeze.linalg.MatrixConstructors.tabulate$(Matrix.scala:227)
at breeze.linalg.DenseMatrix$.tabulate(DenseMatrix.scala:360)
at breeze.linalg.Matrix.toDenseMatrix(Matrix.scala:124)
at breeze.linalg.Matrix.toDenseMatrix$(Matrix.scala:123)
at breeze.linalg.SliceMatrix.toDenseMatrix(SliceMatrix.scala:13)
at breeze.linalg.Matrix.toDenseMatrix$mcD$sp(Matrix.scala:123)
at breeze.linalg.Matrix.toDenseMatrix$mcD$sp$(Matrix.scala:123)
at breeze.linalg.SliceMatrix.toDenseMatrix$mcD$sp(SliceMatrix.scala:13)
at org.apache.spark.mllib.clustering.OnlineLDAOptimizer$.variationalTopicInference(LDAOptimizer.scala:618)
at org.apache.spark.ml.clustering.LDAModel.$anonfun$getTopicDistributionMethod$1(LDA.scala:502)
... 14 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2478)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2427)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2426)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2426)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1131)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1131)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1131)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2678)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2625)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2613)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:917)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2313)
at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:298)
at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:308)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:82)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88)
at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:508)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58)
at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:2986)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3692)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2710)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3684)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:835)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3682)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2710)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2917)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:304)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:341)
at sun.reflect.GeneratedMethodAccessor410.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(LDAModel$$Lambda$5764/1348803876: (struct<type:tinyint,size:int,indices:array<int>,values:array<double>>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
at org.apache.spark.scheduler.Task.run(Task.scala:117)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$11(Executor.scala:657)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:660)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.lang.IndexOutOfBoundsException: (4731,0) not in [-4157,4157) x [-6,6)
at breeze.linalg.DenseMatrix$mcD$sp.apply$mcD$sp(DenseMatrix.scala:106)
at breeze.linalg.DenseMatrix$mcD$sp.apply(DenseMatrix.scala:103)
at breeze.linalg.DenseMatrix$mcD$sp.apply(DenseMatrix.scala:52)
at breeze.linalg.Matrix.apply(Matrix.scala:44)
at breeze.linalg.Matrix.apply$(Matrix.scala:44)
at breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:52)
at breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:52)
at breeze.linalg.SliceMatrix.apply(SliceMatrix.scala:23)
at breeze.linalg.Matrix.$anonfun$toDenseMatrix$1(Matrix.scala:125)
at breeze.linalg.Matrix.$anonfun$toDenseMatrix$1$adapted(Matrix.scala:124)
at breeze.linalg.MatrixConstructors.$anonfun$tabulate$2(Matrix.scala:230)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at breeze.linalg.MatrixConstructors.$anonfun$tabulate$1(Matrix.scala:229)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at breeze.linalg.MatrixConstructors.tabulate(Matrix.scala:229)
at breeze.linalg.MatrixConstructors.tabulate$(Matrix.scala:227)
at breeze.linalg.DenseMatrix$.tabulate(DenseMatrix.scala:360)
at breeze.linalg.Matrix.toDenseMatrix(Matrix.scala:124)
at breeze.linalg.Matrix.toDenseMatrix$(Matrix.scala:123)
at breeze.linalg.SliceMatrix.toDenseMatrix(SliceMatrix.scala:13)
at breeze.linalg.Matrix.toDenseMatrix$mcD$sp(Matrix.scala:123)
at breeze.linalg.Matrix.toDenseMatrix$mcD$sp$(Matrix.scala:123)
at breeze.linalg.SliceMatrix.toDenseMatrix$mcD$sp(SliceMatrix.scala:13)
at org.apache.spark.mllib.clustering.OnlineLDAOptimizer$.variationalTopicInference(LDAOptimizer.scala:618)
at org.apache.spark.ml.clustering.LDAModel.$anonfun$getTopicDistributionMethod$1(LDA.scala:502)
... 14 more
注:预测的新数据大于训练数据。
我使用的是databricks运行时版本:7.1ml(包括apachespark3.0.0和scala 2.12)。
我已经在这里审阅了一个相关问题的评论,但是我仍然没有解决这个问题。有人知道会发生什么?。哪里出错了?提前谢谢!。
暂无答案!
目前还没有任何答案,快来回答吧!