有人能解释一下为什么我要手动复制 com.amazonaws_aws-java-sdk-bundle
到我的本地$spark\u家,虽然我使用的是自动包解析器 --packages
?
我所做的是从spark shell开始提交spark:
$SPARK_HOME/bin/spark-shell \
--master k8s://https://localhost:6443 \
--deploy-mode client \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=spark:spark-docker \
--packages org.apache.hadoop:hadoop-aws:3.2.0,io.delta:delta-core_2.12:0.7.0 \
--conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore \
--conf spark.hadoop.fs.path.style.access=true \
--conf spark.hadoop.fs.s3a.access.key=$MINIO_ACCESS_KEY \
--conf spark.hadoop.fs.s3a.secret.key=$MINIO_SECRET_KEY \
--conf spark.hadoop.fs.s3a.endpoint=$MINIO_ENDPOINT \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
--conf spark.hadoop.fs.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.driver.port=4040 \
--name spark-locally
我的设置是最新的spark3.0.1和hadoop3.2,还有本地kubernetes和docker桌面for mac。
如我所说,上面的代码将成功下载依赖项 --packages org.apache.hadoop:hadoop-aws:3.2.0
其中com.amazonaws\u aws-java-sdk-bundle-1.11.375作为依赖项:
Ivy Default Cache set to: /Users/sspaeti/.ivy2/cache
The jars for the packages stored in: /Users/sspaeti/.ivy2/jars
:: loading settings :: url = jar:file:/Users/sspaeti/Documents/spark/spark-3.0.1-bin-hadoop3.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hadoop#hadoop-aws added as a dependency
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-91fd31e1-0b2a-448c-9c69-fd9dc430d41c;1.0
confs: [default]
found org.apache.hadoop#hadoop-aws;3.2.0 in central
found com.amazonaws#aws-java-sdk-bundle;1.11.375 in central
found io.delta#delta-core_2.12;0.7.0 in central
found org.antlr#antlr4;4.7 in central
found org.antlr#antlr4-runtime;4.7 in central
found org.antlr#antlr-runtime;3.5.2 in central
found org.antlr#ST4;4.0.8 in central
found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central
found org.glassfish#javax.json;1.0.4 in central
found com.ibm.icu#icu4j;58.2 in central
:: resolution report :: resolve 376ms :: artifacts dl 22ms
:: modules in use:
com.amazonaws#aws-java-sdk-bundle;1.11.375 from central in [default]
com.ibm.icu#icu4j;58.2 from central in [default]
io.delta#delta-core_2.12;0.7.0 from central in [default]
org.abego.treelayout#org.abego.treelayout.core;1.0.3 from central in [default]
org.antlr#ST4;4.0.8 from central in [default]
org.antlr#antlr-runtime;3.5.2 from central in [default]
org.antlr#antlr4;4.7 from central in [default]
org.antlr#antlr4-runtime;4.7 from central in [default]
org.apache.hadoop#hadoop-aws;3.2.0 from central in [default]
org.glassfish#javax.json;1.0.4 from central in [default]
但是为什么,我总是在这里得到错误 java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException
. 我不明白的是,当我在 deploy-mode client
,我以为maven会解决所有依赖于我的本地spark(驱动程序)的问题,不是吗?或者失踪的谜题在哪里?
我也试过了 --packages org.apache.hadoop:hadoop-aws:3.2.0,io.delta:delta-core_2.12:0.7.0,com.amazonaws:aws-java-sdk-bundle:1.11.375
也不走运。
我的解决方案,但不知道我为什么要这么做
有效的方法是我手动复制(从maven复制)或者直接从我下载的 .ivy2
文件夹如下: cp $HOME/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.375.jar $SPARK_HOME/jars
之后,我可以成功地读写本地s3(minio)。
与jupyter合作
另一件奇怪的事情是,我还安装了jupyter笔记本在我当地的kubernetes和那里它正常工作 --packages
. 我用的是pyspark,区别是,pyspark在工作,但不是在Spark壳上?
如果是这样,我将如何在pyspark本地而不是spark shell上进行相同的测试?
非常感谢你的解释,我已经浪费了很多时间。
暂无答案!
目前还没有任何答案,快来回答吧!