当我试图通过pyspark从Cassandra表读取数据时,工作正常。但当我试图将Dataframe写入Cassandra表时,会出现java.lang.NoClassDefFoundError,其中包含相同的Spark-Cassandra连接包。
版本详细信息:
cassandra :
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.0.18 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.
Spark:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.3
/_/
Using Python version 2.7.5 (default, Jun 11 2019 14:33:56)
Spark - Cassandra连接器:
bin/pyspark --packages datastax:spark-cassandra-connector:2.4.0-s_2.11
编码:
>>> from pyspark import SparkContext, SparkConf
>>> from pyspark.sql import SQLContext, SparkSession
>>> from pyspark.sql.types import *
>>> import os
>>> spark = SparkSession.builder \
... .appName('SparkCassandraApp') \
... .config('spark.cassandra.connection.host', '127.0.0.1') \
... .config('spark.cassandra.connection.port', '9042') \
... .config('spark.cassandra.output.consistency.level','ONE') \
... .master('local[2]') \
... .getOrCreate()
>>> df = spark.read.format("org.apache.spark.sql.cassandra").options(table="emp",keyspace="tutorialspoint").load()
>>> df.show()
+------+---------+--------+----------+-------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|
+------+---------+--------+----------+-------+
| 2|Hyderabad| robin|9848022339| 40000|
| 1|Hyderabad| ram|9848022338| 50000|
| 3| Chennai| rahman|9848022330| 45000|
+------+---------+--------+----------+-------+
在同一终端中尝试写入Cassandra表
>>> df.write\
... .format("org.apache.spark.sql.cassandra")\
... .mode('append')\
... .options(table="emp", keyspace="tutorialspoint")\
... .save()
19/09/26 15:34:15 ERROR Executor: Exception in task 6.0 in stage 3.0 (TID 25)
java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsSupport$class.$init$(OutputMetricsUpdater.scala:107)
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsUpdater.<init>(OutputMetricsUpdater.scala:153)
at org.apache.spark.metrics.OutputMetricsUpdater$.apply(OutputMetricsUpdater.scala:75)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:209)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:197)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:183)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/09/26 15:34:15 ERROR Executor: Exception in task 7.0 in stage 3.0 (TID 26)
java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsSupport$class.$init$(OutputMetricsUpdater.scala:107)
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsUpdater.<init>(OutputMetricsUpdater.scala:153)
at org.apache.spark.metrics.OutputMetricsUpdater$.apply(OutputMetricsUpdater.scala:75)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:209)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:197)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:183)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2条答案
按热度按时间ubof19bj1#
您需要一个不同的
cassandra-connector
。Datastax连接器可以用于Scala/Java,但您需要一些用于Python对应物的东西。arguenot/pyspark-cassandra是Datastax Spark连接器的Python端口。
示例:
6uxekuva2#
我也遇到过这种情况,可能是连接器的标准--包出了问题。jsr long adder jar丢失了,或者被另一个pyspark不识别的jar替换了。如果你下载丢失的jar,并将其作为另一个--包,它将工作。
由以下答案得出的解
Jupyter Cassandra Save Problem - java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder