我正试着运行Pypark shell,但在执行以下操作时:
(test3.8python) [test@JupyterHub ~]$ python3 /home/test/spark3.1.1/bin/pyspark
我得到以下错误:
File "/home/test/spark3.1.1/bin/pyspark", line 20
if [ -z "${SPARK_HOME}" ]; then
^
SyntaxError: invalid syntax
我在~/.bashrc中设置了以下内容:
export SPARK_HOME=/home/test/spark3.1.1
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=python3
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
如果我试着从jupyter笔记本运行它,如下所示:
import pyspark
from pyspark.sql import SparkSession
# starting deamons for standalone
!/home/test/spark3.1.1/sbin/start-master.sh
!/home/test/spark3.1.1/sbin/start-worker.sh spark://JupyterHub:7077
# spark standalone
spark = SparkSession.builder \
.appName("test") \
.master("spark://JupyterHub:7077")\
.config("spark.cores.max","5")\
.config("spark.executor.memory","2g")\
.config("spark.jars.packages",'org.elasticsearch:elasticsearch-spark-30_2.12:7.12-SNAPSHOT')\
.config("spark.executor.cores","5")\
.enableHiveSupport() \
.getOrCreate()
我得到以下错误:
ModuleNotFoundError: No module named 'pyspark'
但我不明白为什么,因为我在bash中指定了spark文件夹中python文件的路径,并确保发生了更改。此外,我在闲逛,试图使用图书馆 findspark
,现在,如果我尝试使用添加的导入运行此代码:
import findspark
spark_location='/home/test/spark3.1.1/'
findspark.init(spark_home=spark_location)
import pyspark
from pyspark.sql import SparkSession
# starting deamons for standalone
!/home/test/spark3.1.1/sbin/start-master.sh
!/home/test/spark3.1.1/sbin/start-worker.sh spark://JupyterHub:7077
# spark standalone
spark = SparkSession.builder \
.appName("test") \
.master("spark://JupyterHub:7077")\
.config("spark.cores.max","5")\
.config("spark.executor.memory","2g")\
.config("spark.jars.packages",'org.elasticsearch:elasticsearch-spark-30_2.12:7.12.0-SNAPSHOT')\
.config("spark.executor.cores","5")\
.enableHiveSupport() \
.getOrCreate()
它看起来能够找到pyspark,但是这是0,因为我已经在bash文件中指定了所有内容,并且已经将spark\u设置为home,但是我得到了另一个错误:
starting org.apache.spark.deploy.master.Master, logging to /home/test/spark3.1.1//logs/spark-test-org.apache.spark.deploy.master.Master-1-JupyterHub.out
starting org.apache.spark.deploy.worker.Worker, logging to /home/test/spark3.1.1//logs/spark-test-org.apache.spark.deploy.worker.Worker-1-JupyterHub.out
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-7-7d402e7d71bf> in <module>
10
11 #spark standalone
---> 12 spark = SparkSession.builder \
13 .appName("test") \
14 .master("spark://JupyterHub:7077")\
~/spark3.1.1/python/pyspark/sql/session.py in getOrCreate(self)
226 sparkConf.set(key, value)
227 # This SparkContext may be an existing one.
--> 228 sc = SparkContext.getOrCreate(sparkConf)
229 # Do not update `SparkConf` for existing `SparkContext`, as it's shared
230 # by all sessions.
~/spark3.1.1/python/pyspark/context.py in getOrCreate(cls, conf)
382 with SparkContext._lock:
383 if SparkContext._active_spark_context is None:
--> 384 SparkContext(conf=conf or SparkConf())
385 return SparkContext._active_spark_context
386
~/spark3.1.1/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
142 " is not allowed as it is a security risk.")
143
--> 144 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
145 try:
146 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
~/spark3.1.1/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
329 with SparkContext._lock:
330 if not SparkContext._gateway:
--> 331 SparkContext._gateway = gateway or launch_gateway(conf)
332 SparkContext._jvm = SparkContext._gateway.jvm
333
~/spark3.1.1/python/pyspark/java_gateway.py in launch_gateway(conf, popen_kwargs)
106
107 if not os.path.isfile(conn_info_file):
--> 108 raise Exception("Java gateway process exited before sending its port number")
109
110 with open(conn_info_file, "rb") as info:
Exception: Java gateway process exited before sending its port number
我已经查过了upyterhub:7077 on 默认的8080端口和一切都是活的,所以我成功地启动了主和工人。
即使在本地模式下运行spark master("local[*]")
我得到了上面同样的错误
我完全迷路了,你知道为什么我不能在shell和jupyter笔记本上运行pyspark吗?
谢谢
暂无答案!
目前还没有任何答案,快来回答吧!