设置增量表spark session临时文件的S3目录

6yt4nkrj  于 2023-11-21  发布在  Apache
关注(0)|答案(1)|浏览(118)

当我尝试创建一个Spark会话并在S3中设置本地目录(以避免执行器内存问题)时,Spark仍然会在我的本地磁盘中创建临时文件。
这是我正在使用的配置:

spark = SparkSession.builder.master("local[*]").appName("upsert_delta_table") \
    .config("spark.driver.memory", "50g") \
    .config("spark.executor.memory", "50g") \
    .config("spark.hadoop.fs.s3a.access.key", aws_access_key_id) \
    .config("spark.hadoop.fs.s3a.secret.key", aws_secret_access_key) \
    .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider") \
    .config("spark.hadoop.fs.s3a.endpoint", "s3.sa-east-1.amazonaws.com") \
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4") \
    .config('spark.hadoop.fs.s3a.path.style.access', True) \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.local.dir", "s3a://datalake-silver.dummy/spark_logs/temp_files_for_calculations") \
    .config("spark.hadoop.fs.s3a.buffer.dir", "s3a://datalake-silver.dummy/spark_logs/temp_files_for_calculations") \
    .config("spark.worker.cleanup.enabled", True) \
    .getOrCreate()

字符串
例如,我可以使用这种表示法挂载、读取和写入delta表(因此不是连接问题):

spark.sql(f"CREATE TABLE IF NOT EXISTS {database}.{table_name} USING DELTA LOCATION 's3://datalake-silver.dummy/database/table_1'")


我从终端运行脚本,在日志中发现了这个,但我不知道该怎么办:

/21 00:19:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/21 00:19:40 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
23/11/21 00:19:41 WARN Utils: The configured local directories are not expected to be URIs; however, got suspicious values [s3a://datalake-silver.dummy/spark_logs/temp_files_for_calculations]. Please check your configured local directories.
23/11/21 00:19:43 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties

ldioqlga

ldioqlga1#

  1. spark.local.dir是主机本地的
  2. s3不是真实的文件系统,不能用作替代文件系统
  3. spark.hadoop.fs.s3a.buffer.dir必须是本地fs,因为它是被写入的块在上传到S3存储之前被缓冲的地方。声明它是一个s3存储没有任何意义。
    您需要更多的本地存储,临时或在EBS中。

相关问题