当我尝试创建一个Spark会话并在S3中设置本地目录(以避免执行器内存问题)时,Spark仍然会在我的本地磁盘中创建临时文件。
这是我正在使用的配置:
spark = SparkSession.builder.master("local[*]").appName("upsert_delta_table") \
.config("spark.driver.memory", "50g") \
.config("spark.executor.memory", "50g") \
.config("spark.hadoop.fs.s3a.access.key", aws_access_key_id) \
.config("spark.hadoop.fs.s3a.secret.key", aws_secret_access_key) \
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider") \
.config("spark.hadoop.fs.s3a.endpoint", "s3.sa-east-1.amazonaws.com") \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4") \
.config('spark.hadoop.fs.s3a.path.style.access', True) \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.local.dir", "s3a://datalake-silver.dummy/spark_logs/temp_files_for_calculations") \
.config("spark.hadoop.fs.s3a.buffer.dir", "s3a://datalake-silver.dummy/spark_logs/temp_files_for_calculations") \
.config("spark.worker.cleanup.enabled", True) \
.getOrCreate()
字符串
例如,我可以使用这种表示法挂载、读取和写入delta表(因此不是连接问题):
spark.sql(f"CREATE TABLE IF NOT EXISTS {database}.{table_name} USING DELTA LOCATION 's3://datalake-silver.dummy/database/table_1'")
型
我从终端运行脚本,在日志中发现了这个,但我不知道该怎么办:
/21 00:19:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/21 00:19:40 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
23/11/21 00:19:41 WARN Utils: The configured local directories are not expected to be URIs; however, got suspicious values [s3a://datalake-silver.dummy/spark_logs/temp_files_for_calculations]. Please check your configured local directories.
23/11/21 00:19:43 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
型
1条答案
按热度按时间ldioqlga1#
spark.local.dir
是主机本地的您需要更多的本地存储,临时或在EBS中。