我试图在运行ApacheToree-scala内核的jupyter笔记本中使用ApacheSpark读取kudu表。
spark版本:2.2.0 scala版本:2.11 apache toree版本:0.3
这是我用来读取kudu表的代码
val kuduMasterAddresses = KUDU_MASTER_ADDRESSES_HERE
val kuduMasters: String = Seq(kuduMasterAddresses).mkString(",")
val kuduContext = new KuduContext(kuduMasters, spark.sparkContext)
val table = TABLE_NAME_HERE
def readKudu(table: String) = {
val tableKuduOptions: Map[String, String] = Map(
"kudu.table" -> table,
"kudu.master" -> kuduMasters
)
spark.sqlContext.read.options(tableKuduOptions).kudu
}
val kuduTableDF = readKudu(table)
使用kuducontext.tableexists(table)返回true。使用kudutabledf.columns会给出一个具有正确列名的字符串数组。
当我尝试应用count、show等操作时,问题就出现了。。。引发当前异常:
名称:org.apache.spark.sparkeexception消息:由于阶段失败而中止作业:获取任务结果时出现异常:java.io.ioexception:java.lang.classnotfoundexception:org.apache.kudu.spark.kudu.kuducontext$timestackTrace:atorg.apache.spark.scheduler.dagscheduler.org$apache$spark$scheduler$dagscheduler$$failjobandindependentstages(dagscheduler)。scala:1567)在org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler。scala:1555)在org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler。scala:1554)在scala.collection.mutable.resizablearray$class.foreach(resizablearray。scala:59)在scala.collection.mutable.arraybuffer.foreach(arraybuffer。scala:48)在org.apache.spark.scheduler.dagscheduler.abortstage(dagscheduler。scala:1554)位于org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler)。scala:803)在org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler)。scala:803)在scala.option.foreach(option。scala:257)在org.apache.spark.scheduler.dagscheduler.handletasksetfailed(dagscheduler。scala:803)位于org.apache.spark.scheduler.dagschedulereventprocessloop.doonreceive(dagscheduler。scala:1782)在org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler。scala:1737)位于org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler。scala:1726)在org.apache.spark.util.eventloop$$anon$1.run(eventloop。scala:48)
在org.apache.spark.scheduler.dagscheduler.runjob(dagscheduler。scala:619)在org.apache.spark.sparkcontext.runjob(sparkcontext。scala:2031)在org.apache.spark.sparkcontext.runjob(sparkcontext。scala:2052)在org.apache.spark.sparkcontext.runjob(sparkcontext。scala:2071)在org.apache.spark.sql.execution.sparkplan.executetake(sparkplan。scala:336)在org.apache.spark.sql.execution.collectlimitexec.executecollect(限制。scala:38)org.apache.spark.sql.dataset.org$apache$spark$sql$dataset$$collectfromplan(数据集)。scala:2865)在org.apache.spark.sql.dataset$$anonfun$head$1.apply(dataset。scala:2154)在org.apache.spark.sql.dataset$$anonfun$head$1.apply(dataset。scala:2154)在org.apache.spark.sql.dataset$$anonfun$55.apply(数据集。scala:2846)位于org.apache.spark.sql.execution.sqlexecution$.withnewexecutionid(sqlexecution)。scala:65)在org.apache.spark.sql.dataset.withaction(dataset。scala:2845)在org.apache.spark.sql.dataset.head(dataset。scala:2154)在org.apache.spark.sql.dataset.take(dataset。scala:2367)在org.apache.spark.sql.dataset.showstring(数据集。scala:241)在org.apache.spark.sql.dataset.show(dataset。scala:641)在org.apache.spark.sql.dataset.show(dataset。scala:600)在org.apache.spark.sql.dataset.show(dataset。scala:609)
我已经在apache toree笔记本中使用了adddeps魔术,如下所示:
%AddDeps org.apache.kudu kudu-spark2_2.11 1.6.0 --transitive --trace
%AddDeps org.apache.kudu kudu-client 1.6.0 --transitive --trace
我在执行以下导入时没有问题:
import org.apache.kudu.spark.kudu._
暂无答案!
目前还没有任何答案,快来回答吧!