hadoop 2.9.2、spark 2.4.0访问aws s3a bucket

balp4ylt  于 2021-05-31  发布在  Hadoop
关注(0)|答案(3)|浏览(491)

已经有几天了,但我无法使用spark从公共amazon bucket下载:(
这里是 spark-shell 命令:

spark-shell  --master yarn
              -v
              --jars file:/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar,file:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar
              --driver-class-path=/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar

应用程序已启动,shell正在等待提示:

____              __
  / __/__  ___ _____/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 2.4.0
   /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val data1 = sc.textFile("s3a://my-bucket-name/README.md")

18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 242.1 KB, free 246.7 MB)
18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.2 KB, free 246.6 MB)
18/12/25 13:06:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop-edge01:3545 (size: 24.2 KB, free: 246.9 MB)
18/12/25 13:06:40 INFO SparkContext: Created broadcast 0 from textFile at <console>:24
data1: org.apache.spark.rdd.RDD[String] = s3a://my-bucket-name/README.md MapPartitionsRDD[1] at textFile at <console>:24

scala> data1.count()

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:97)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:206)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD.count(RDD.scala:1168)
... 49 elided
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.fs.StorageStatistics
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 77 more

scala>

所有aws密钥和密钥都在hadoop/core-site.xml中设置,如下所述:hadoop aws模块:与amazonweb服务的集成
bucket是公共的-任何人都可以下载(用curl-o测试)
您可以看到,所有的.jar都是由hadoop本身从 /usr/local/hadoop/share/hadoop/tools/lib/ 文件夹
中没有其他设置 spark-defaults.conf -只有命令行中发送的内容
两个jar都不提供此类:

jar tf /usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar | grep org/apache/hadoop/fs/StorageStatistics
(no result)

jar tf /usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar | grep org/apache/hadoop/fs/StorageStatistics
(no result)

我该怎么办?我忘了再加一罐吗?它的具体配置是什么 hadoop-aws 以及 aws-java-sdk-bundle ? 版本?

kxxlusnw

kxxlusnw1#

我劝你不要做你做的事。您正在运行预构建的spark hadoop 2.7.2 开瓶 hadoop 2.9.2 您在类路径中添加了更多jar,以便从 hadoop 2.7.3 解决问题的版本。
您应该使用“无hadoop”spark版本,并通过配置提供hadoop文件,如以下链接所示-https://spark.apache.org/docs/2.4.0/hadoop-provided.html
主要部分:
conf/spark-env.sh 如果 hadoop 二进制正在你的路上

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

具有指向的显式路径 hadoop 二元的

export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)

传递hadoop配置目录

export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)
rdlzhqv9

rdlzhqv92#

嗯。。。。我终于发现了问题。。
主要问题是我为hadoop预装了spark。它是“针对Hadoop2.7及更高版本的v2.4.0预构建”。这是一个有点误导的标题,因为你看到我的斗争与它上面。实际上spark附带了不同版本的hadoop jar。来自:/usr/local/spark/jars/的列表显示它具有:
hadoop-common-2.7.3.jar
hadoop-client-2.7.3.jar
....
它只缺少:hadoopaws和awsjavasdk。我在maven存储库:hadoop-aws-v2.7.3和it依赖关系:aws-java-sdk-v1.7.4和瞧!下载这些jar并将它们作为参数发送给spark。这样地:
Spark壳
--粗纱
-五
--jars文件:/home/aws-java-sdk-1.7.4.jar,文件:/home/hadoop-aws-2.7.3.jar
--驱动程序类路径=/home/aws-java-sdk-1.7.4.jar:/home/hadoop-aws-2.7.3.jar
做得很好!!!
我只是想知道为什么所有来自hadoop的jar(我把它们都作为参数发送给--jar和--driver类路径)都没有跟上。spark会自动选择jars而不是我发送的内容

nukf8bse

nukf8bse3#

我使用spark 2.4.5,这就是我所做的,它为我工作。我能够连接到aws s3从Spark在我的地方。

(1) Download spark 2.4.5 from here:https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-without-hadoop-scala-2.12.tgz. This spark does not have hadoop in it.
(2) Download hadoop. https://archive.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
(3) Update .bash_profile
SPARK_HOME = <SPARK_PATH> #example /home/spark-2.4.5/spark-2.4.5-bin-without-hadoop-scala-2.12
PATH=$SPARK_HOME/bin
(4) Add Hadoop in spark env
Copy spark-env.sh.template as spark-env.sh
add export SPARK_DIST_CLASSPATH=$(<hadoop_path> classpath)
here <hadoop_path> is path to your hadoop /home/hadoop-3.2.1/bin/hadoop

相关问题