Pyspark,无法识别python包路径,没有这样的文件或目录错误

r1zhe5dt  于 2023-04-05  发布在  Spark
关注(0)|答案(1)|浏览(334)

本文件之后:https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
我打包了环境并保存在本地,并在--archives中提到它,以及将环境放在HDFS中并在--archives中提到它。没有运气。有人能帮我解决这个问题吗?
job.sh 调用run_job.sh,然后触发作业。
job.sh

BASE_OUTPUT_PATH='/user/svc-edl_ei_dld/dummy/'
DEFAULT_RUN_DATE=`date +"%Y%m%d"`
export PYSPARK_PYTHON=/opt/miniconda3/envs/ei/bin/python3
export PYSPARK_DRIVER_PYTHON=/opt/miniconda3/envs/ei/bin/python3
export RUN_DATE=$1
export LatLonOutput=$BASE_OUTPUT_PATH/stationDistanceOutput/date=$RUN_DATE
export billboards=$BASE_OUTPUT_PATH/allAssets/
export speedpath=$BASE_OUTPUT_PATH/speedFilter/
export selectPointsPath=$BASE_OUTPUT_PATH/selectPointsDF/date=$RUN_DATE
export partition=10

echo "Running for date $RUN_DATE"

sh scripts/run_selectPoints.sh $RUN_DATE $LatLonOutput $billboards $speedpath $selectPointsPath $partition >> /aaa/bbb/ccc/ddd/eee/fff/ggg/hhh.log 2>&1
if [ $? -ne 0 ];then exit 1 ; fi
echo "Run completed for job for date $RUN_DATE"

run_job.sh

date=$1
inputpath=$2
assetParquetPath=$3
speedFilterpath=$4
outputpath=$5
partition=$6

spark3-submit -v \
--master yarn \
--deploy-mode client \
--num-executors 10 \
--driver-memory 8g \
--executor-memory 8g \
--executor-cores 4 \
--conf spark.sql.shuffle.partitions=6000 \
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
--conf spark.pyspark.python=/opt/miniconda3/envs/ei/bin/python3 \
--conf spark.pyspark.driver.python=/opt/miniconda3/envs/ei/bin/python3 \
--archives hdfs:///user/svc-edl_ei_dld/outfront/pythonLibraries/ei.tar.gz#miniconda3 \
--py-files python/file.py python/main_file.py $inputpath $assetParquetPath $speedFilterpath $outputpath $partition

错误

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5) (tbldjftv0072d-hdp.verizon.com executor 3): java.io.IOException: Cannot run program "/opt/miniconda3/envs/ei/bin/python3": error=2, No such file or directory

环境详细信息x1c 0d1x

7nbnzgx9

7nbnzgx91#

我们不需要conda的HDFS位置。简单地说,我们可以使用--archives /local/path/conda/env。它可能会失败,因为在驱动程序中,您没有设置pyspark python路径指向conda env。

--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./miniconda3/bin/python3
or --conf spark.pyspark.python=./miniconda3/bin/python3
--archives ei.tar.gz#miniconda3

如果您仍然想使用HDFS,则需要添加:

--conf spark.yarn.dist.archives=hdfs:///user/svc-edl_ei_dld/outfront/pythonLibraries/ei.tar.gz#miniconda3
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./miniconda3/bin/python3
or --conf spark.pyspark.python=./miniconda3/bin/python3

相关问题