我想在气流中运行我的Spark工作。从链接https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/contrib/operators/spark_submit_operator/index.html,我不确定是否正确理解了这句话:“这个钩子是围绕spark submit二进制文件的 Package 器,用于启动spark submit作业。它要求“spark submit”二进制文件在路径中,或者spark home设置在连接的额外部分中。“我有spark\u home env变量,但是spark submit二进制文件到底是什么?是spark作业的.py文件吗?另一种解决方案似乎是在airflow Web服务器ui中的额外连接中添加一些内容。在这种情况下应该加什么?
任何帮助都将不胜感激。
编辑:我得到的错误是:
[2020-12-07 01:12:58,817] {base_hook.py:89} INFO - Using connection to: id: spark_default. Host: localhost, Port: None, Schema: None, Login: None, Password: None, extra: XXXXXXXX
[2020-12-07 01:12:58,821] {spark_submit_hook.py:325} INFO - Spark-Submit cmd: spark-submit --master localhost --name airflow-spark --queue root.default /home/ubuntu/market_risk/utils/etl.py
[2020-12-07 01:12:58,874] {spark_submit_hook.py:479} INFO - Could not find valid SPARK_HOME while searching ['/', '/home/ubuntu/.local/bin']
[2020-12-07 01:12:58,874] {spark_submit_hook.py:479} INFO -
[2020-12-07 01:12:58,874] {spark_submit_hook.py:479} INFO - Did you install PySpark via a package manager such as pip or Conda? If so,
[2020-12-07 01:12:58,875] {spark_submit_hook.py:479} INFO - PySpark was not found in your Python environment. It is possible your
[2020-12-07 01:12:58,875] {spark_submit_hook.py:479} INFO - Python environment does not properly bind with your package manager.
[2020-12-07 01:12:58,875] {spark_submit_hook.py:479} INFO -
[2020-12-07 01:12:58,875] {spark_submit_hook.py:479} INFO - Please check your default 'python' and if you set PYSPARK_PYTHON and/or
[2020-12-07 01:12:58,875] {spark_submit_hook.py:479} INFO - PYSPARK_DRIVER_PYTHON environment variables, and see if you can import
[2020-12-07 01:12:58,875] {spark_submit_hook.py:479} INFO - PySpark, for example, 'python -c 'import pyspark'.
[2020-12-07 01:12:58,875] {spark_submit_hook.py:479} INFO -
[2020-12-07 01:12:58,875] {spark_submit_hook.py:479} INFO - If you cannot import, you can install by using the Python executable directly,
[2020-12-07 01:12:58,875] {spark_submit_hook.py:479} INFO - for example, 'python -m pip install pyspark [--user]'. Otherwise, you can also
[2020-12-07 01:12:58,875] {spark_submit_hook.py:479} INFO - explicitly set the Python executable, that has PySpark installed, to
[2020-12-07 01:12:58,876] {spark_submit_hook.py:479} INFO - PYSPARK_PYTHON or PYSPARK_DRIVER_PYTHON environment variables, for example,
[2020-12-07 01:12:58,876] {spark_submit_hook.py:479} INFO - 'PYSPARK_PYTHON=python3 pyspark'.
[2020-12-07 01:12:58,876] {spark_submit_hook.py:479} INFO -
[2020-12-07 01:12:58,878] {spark_submit_hook.py:479} INFO - /home/ubuntu/.local/bin/spark-submit: line 27: /bin/spark-class: No such file or directory
[2020-12-07 01:12:58,909] {taskinstance.py:1150} ERROR - Cannot execute: spark-submit --master localhost --name airflow-spark --queue root.default /home/ubuntu/market_risk/utils/etl.py. Error code is: 127.
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/home/ubuntu/.local/lib/python3.6/site-packages/airflow/contrib/operators/spark_submit_operator.py", line 187, in execute
self._hook.submit(self._application)
File "/home/ubuntu/.local/lib/python3.6/site-packages/airflow/contrib/hooks/spark_submit_hook.py", line 405, in submit
self._mask_cmd(spark_submit_cmd), returncode
airflow.exceptions.AirflowException: Cannot execute: spark-submit --master localhost --name airflow-spark --queue root.default /home/ubuntu/market_risk/utils/etl.py. Error code is: 127.
[2020-12-07 01:12:58,912] {taskinstance.py:1194} INFO - Marking task as FAILED. dag_id=market_data, task_id=spark_submit_job, execution_date=20201207T010725, start_date=20201207T011258, end_date=20201207T011258
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/home/ubuntu/.local/lib/python3.6/site-packages/airflow/contrib/operators/spark_submit_operator.py", line 187, in execute
self._hook.submit(self._application)
File "/home/ubuntu/.local/lib/python3.6/site-packages/airflow/contrib/hooks/spark_submit_hook.py", line 405, in submit
self._mask_cmd(spark_submit_cmd), returncode
airflow.exceptions.AirflowException: Cannot execute: spark-submit --master localhost --name airflow-spark --queue root.default /home/ubuntu/market_risk/utils/etl.py. Error code is: 127.
编辑:
我有2个安装的Spark,一个手册(位于 /opt/spark
)另一个是皮普。我删除了pip安装。所以现在Spark安装在 /home/ubuntu/.local/bin'
不再是一个问题,但气流仍然找不到其他手动安装Spark(我仍然可以做一个 spark-sumbit file.py
直接存储在项目所在的文件夹中。新错误是:
[ 2020-12-07 09:52:06,591] {base_hook.py:89} INFO - Using connection to: id: spark_local. Host: local[*], Port: None, Schema: None, Login: None, Password: None, extra: XXXXXXXX
[2020-12-07 09:52:06,593] {spark_submit_hook.py:325} INFO - Spark-Submit cmd: spark-submit --master local[*] --name airflow-spark --queue root.default /home/ubuntu/market_risk/utils/etl.py
[2020-12-07 09:52:06,599] {taskinstance.py:1150} ERROR - [Errno 2] No such file or directory: 'spark-submit': 'spark-submit'
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/home/ubuntu/.local/lib/python3.6/site-packages/airflow/contrib/operators/spark_submit_operator.py", line 187, in execute
self._hook.submit(self._application)
File "/home/ubuntu/.local/lib/python3.6/site-packages/airflow/contrib/hooks/spark_submit_hook.py", line 395, in submit
**kwargs)
File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'spark-submit': 'spark-submit'
暂无答案!
目前还没有任何答案,快来回答吧!