下面是pycharm中的运行时版本。
Java Home /Library/Java/JavaVirtualMachines/jdk-11.0.16.1.jdk/Contents/Home
Java Version 11.0.16.1 (Oracle Corporation)
Scala Version version 2.12.15
Spark Version. spark-3.3.1
Python 3.9
我正在尝试将pyspark Dataframe 写入csv,如下所示:
df.write.csv("/Users/data/data.csv")
并得到错误:
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/pyspark/sql/readwriter.py", line 1240, in csv
self._jwrite.csv(path)
File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/pyspark/sql/utils.py", line 190, in deco
return f(*a, **kw)
File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o747.csv.
: java.lang.ClassNotFoundException: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
spark配置如下:
spark_conf = SparkConf()
spark_conf.setAll(parameters.items())
spark_conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4')
spark_conf.set('spark.hadoop.fs.s3.aws.credentials.provider',
'org.apache.hadoop.fs.s3.TemporaryAWSCredentialsProvider')
spark_conf.set('spark.hadoop.fs.s3.access.key', os.environ.get('AWS_ACCESS_KEY_ID'))
spark_conf.set('spark.hadoop.fs.s3.secret.key', os.environ.get('AWS_SECRET_ACCESS_KEY'))
spark_conf.set('spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled', 'true')
spark_conf.set("com.amazonaws.services.s3.enableV4", "true")
spark_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark_conf.set("fs.s3a.aws.credentials.provider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
spark_conf.set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")
spark_conf.set("hadoop.fs.s3a.path.style.access", "true")
spark_conf.set("hadoop.fs.s3a.fast.upload", "true")
spark_conf.set("hadoop.fs.s3a.fast.upload.buffer", "bytebuffer")
spark_conf.set("fs.s3a.path.style.access", "true")
spark_conf.set("fs.s3a.multipart.size", "128M")
spark_conf.set("fs.s3a.fast.upload.active.blocks", "4")
spark_conf.set("fs.s3a.committer.name", "partitioned")
spark_conf.set("spark.hadoop.fs.s3a.committer.name", "directory")
spark_conf.set("spark.sql.sources.commitProtocolClass",
"org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
spark_conf.set("spark.sql.parquet.output.committer.class",
"org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
spark_conf.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "1")
任何解决这个问题的帮助都是感激的。谢谢!!
1条答案
按热度按时间brc7rcf01#
看起来您没有添加
hadoop-cloud
模块。该类不是核心Spark的一部分。https://search.maven.org/artifact/org.apache.spark/spark-hadoop-cloud_2.12/3.3.1/jar