python在独立模式下连接到spark并使用s3afilesystem访问s3

fhity93d  于 2021-07-15  发布在  Hadoop
关注(0)|答案(2)|浏览(497)

设置
Spark spark-2.4.5-bin-without-hadoop 配置为 hadoop-2.8.5 .
spark正在独立模式下运行。
pyspark应用程序作为一个单独的容器运行,并从另一个容器连接到主机。
需要发生什么?
pyspark应用程序应该能够通过使用 org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider 尝试的解决方案#1
master、workers和pyspark应用程序都有以下jar / ```
hadoop-aws-2.8.5.jar
hadoop-common-2.8.5.jar
aws-java-sdk-1.10.6.jar
jackson-core-2.2.3.jar
aws-java-sdk-s3-1.10.6.jar
jackson-databind-2.2.3.jar
jackson-annotations-2.2.3.jar
joda-time-2.9.4.jar

Pypark应用程序的配置方式如下:

builder = SparkSession.builder

builder = builder.config("spark.executor.userClassPathFirst", "true")
builder = builder.config("spark.driver.userClassPathFirst", "true")

class_path = "/hadoop-aws-2.8.5.jar:/hadoop-common-2.8.5.jar:/aws-java-sdk-1.10.6.jar:/jackson-core-2.2.3.jar:/aws-java-sdk-s3-1.10.6.jar:/jackson-databind-2.2.3.jar:/jackson-annotations-2.2.3.jar:/joda-time-2.9.4.jar"
builder = builder.config("spark.driver.extraClassPath", class_path)
builder = builder.config("spark.executor.extraClassPath", class_path)

builder = builder.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
builder = builder.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
builder = builder.config("spark.hadoop.fs.s3a.access.key", os.environ.get("AWS_ACCESS_KEY_ID"))
builder = builder.config("spark.hadoop.fs.s3a.secret.key", os.environ.get("AWS_SECRET_ACCESS_KEY"))
builder = builder.config("spark.hadoop.fs.s3a.session.token", os.environ.get("AWS_SESSION_TOKEN"))

输出

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.security.authentication.util.KerberosUtil.hasKerberosTicket(Ljavax/security/auth/Subject;)Z
....

这是因为 `hadoop-auth` 他失踪了。我下载了 `hadoop-auth` 又犯了一个错误。这使我得出结论,我将需要提供所有依赖jar。
而不是手动通过maven。我试着导入 `hadoop-aws` 到intellij idea中,看看是否要导入其他依赖项,它给了我这个列表:(不知道是否有 `mvn` 命令会做同样的事情)

accessors-smart-1.2.jar
activation-1.1.jar
apacheds-i18n-2.0.0-M15.jar
apacheds-kerberos-codec-2.0.0-M15.jar
api-asn1-api-1.0.0-M20.jar
api-util-1.0.0-M20.jar
asm-3.1.jar
asm-5.0.4.jar
avro-1.7.4.jar
aws-java-sdk-core-1.10.6.jar
aws-java-sdk-kms-1.10.6.jar
aws-java-sdk-s3-1.10.6.jar
commons-beanutils-1.7.0.jar
commons-beanutils-core-1.8.0.jar
commons-cli-1.2.jar
commons-codec-1.4.jar
commons-collections-3.2.2.jar
commons-compress-1.4.1.jar
commons-configuration-1.6.jar
commons-digester-1.8.jar
commons-io-2.4.jar
commons-lang-2.6.jar
commons-logging-1.1.3.jar
commons-math3-3.1.1.jar
commons-net-3.1.jar
curator-client-2.7.1.jar
curator-framework-2.7.1.jar
curator-recipes-2.7.1.jar
gson-2.2.4.jar
guava-11.0.2.jar
hadoop-annotations-2.8.5.jar
hadoop-auth-2.8.5.jar
hadoop-aws-2.8.5.jar
hadoop-common-2.8.5.jar
htrace-core4-4.0.1-incubating.jar
httpclient-4.5.2.jar
httpcore-4.4.4.jar
jackson-annotations-2.2.3.jar
jackson-core-2.2.3.jar
jackson-core-asl-1.9.13.jar
jackson-databind-2.2.3.jar
jackson-jaxrs-1.8.3.jar
jackson-mapper-asl-1.9.13.jar
jackson-xc-1.8.3.jar
java-xmlbuilder-0.4.jar
jaxb-api-2.2.2.jar
jaxb-impl-2.2.3-1.jar
jcip-annotations-1.0-1.jar
jersey-core-1.9.jar
jersey-json-1.9.jar
jersey-server-1.9.jar
jets3t-0.9.0.jar
jettison-1.1.jar
jetty-6.1.26.jar
jetty-sslengine-6.1.26.jar
jetty-util-6.1.26.jar
joda-time-2.9.4.jar
jsch-0.1.54.jar
json-smart-2.3.jar
jsp-api-2.1.jar
jsr305-3.0.0.jar
log4j-1.2.17.jar
netty-3.7.0.Final.jar
nimbus-jose-jwt-4.41.1.jar
paranamer-2.3.jar
protobuf-java-2.5.0.jar
servlet-api-2.5.jar
slf4j-api-1.7.10.jar
slf4j-log4j12-1.7.10.jar
snappy-java-1.0.4.1.jar
stax-api-1.0-2.jar
xmlenc-0.52.jar
xz-1.0.jar
zookeeper-3.4.6.jar

我怀疑这些也需要包含在pyspark应用程序中。下载每一个都太长了 `.jar` 分开,可能不是办法。
尝试解决方案#2

builder = SparkSession.builder

builder = builder.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.8.5,org.apache.hadoop:hadoop-common:2.8.5,com.amazonaws:aws-java-sdk-s3:1.10.6")

builder = builder.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
builder = builder.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
builder = builder.config("spark.hadoop.fs.s3a.access.key", os.environ.get("AWS_ACCESS_KEY_ID"))
builder = builder.config("spark.hadoop.fs.s3a.secret.key", os.environ.get("AWS_SECRET_ACCESS_KEY"))
builder = builder.config("spark.hadoop.fs.s3a.session.token", os.environ.get("AWS_SESSION_TOKEN"))

而不是提供 `jars` 我用 `--packages` 它还会安装依赖项。但是,在尝试从s3读取时,我会出现以下错误:

java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
...

我的研究
spark使用s3a:java.lang.nosuchmethoderror
noclassdeffounderror:org/apache/hadoop/fs/streamcapabilities在使用spark读取s3数据时
https://hadoop.apache.org/docs/r3.1.0/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#java.lang.nosuchmethoderror_referencing_an_org.apache.hadoop_class
由此,我得出结论,我的问题与 `hadoop-aws` 依赖关系。但是,我在检查这个时遇到了问题。运行spark的hadoop版本与我连接时使用的jar版本相同,我提供了所有依赖项。
任何用于调试此问题的工具或命令都会很有帮助。
ibps3vxo

ibps3vxo1#

尝试在s3 bucket中创建一个管理员用户,并使用管理员凭据连接,而不是使用临时连接。
https://bartek-blog.github.io/python/spark/2019/04/22/how-to-access-s3-from-pyspark.html

b1zrtrql

b1zrtrql2#

我从这个网站找到了这个代码来帮助你

session = boto3.session.Session(profile_name=’MyUserProfile’)
sts_connection = session.client(‘sts’)
response = sts_connection.assume_role(RoleArn=’ARN_OF_THE_ROLE_TO_ASSUME’, RoleSessionName=’THIS_SESSIONS_NAME’,DurationSeconds=3600)
credentials = response[‘Credentials’]

url = str(‘s3a://bucket/key/data.csv’)

spark.sparkContext._jsc.hadoopConfiguration().set(‘fs.s3a.aws.credentials.provider’, ‘org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider’)
spark.sparkContext._jsc.hadoopConfiguration().set(‘fs.s3a.access.key’, credentials[‘AccessKeyId’])
spark.sparkContext._jsc.hadoopConfiguration().set(‘fs.s3a.secret.key’, credentials[‘SecretAccessKey’])
spark.sparkContext._jsc.hadoopConfiguration().set(‘fs.s3a.session.token’, credentials[‘SessionToken’])
spark.read.csv(url).show(1)

相关问题