从Cassandra阅读数据到pyspark dataframe时count()函数失败

r3i60tvu  于 2023-08-02  发布在  Spark
关注(0)|答案(1)|浏览(97)

我从Cassandra阅读数据:

df = spark.read\
    .format("org.apache.spark.sql.cassandra")\
    .options(**configs)\
    .options(table=tablename, keyspace=keyspace)\
    .option("ssl", True)\
    .option("sslmode", "require")\
    .load()

字符串
这个df就是pyspark dataframe。我可以在这个df上执行show(),printSchema()函数,但是当我打印时

df.count()


抛出错误:

An error was encountered:
An error occurred while calling o1394.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in stage 
48.0 failed 4 times, most recent failure: Lost task 19.3 in stage 48.0 (TID 2053, js- 
56258-63801-i-32-w-1.net, executor 9): java.lang.IllegalArgumentException: 
requirement failed: Column not found in Java driver Row: count


如何解决此问题?提前致谢

au9on6nz

au9on6nz1#

我假设它不会总是在同一个阶段失败。如果是这种情况,那么您可以尝试调整读/写参数:
https://github.com/datastax/spark-cassandra-connector/blob/b2.4/doc/reference.md#read-tuning-parameters
https://github.com/datastax/spark-cassandra-connector/blob/b2.4/doc/reference.md#write-tuning-parameters
启动pyspark时,需要传入--conf spark.cassandra.<option>

相关问题