运行pyspark shell或从jupyter笔记本运行时出错

vwkv1x7d  于 2021-07-12  发布在  Spark
关注(0)|答案(0)|浏览(432)

我正试着运行Pypark shell,但在执行以下操作时:

(test3.8python) [test@JupyterHub ~]$ python3 /home/test/spark3.1.1/bin/pyspark

我得到以下错误:

File "/home/test/spark3.1.1/bin/pyspark", line 20  
if [ -z "${SPARK_HOME}" ]; then
        ^
SyntaxError: invalid syntax

我在~/.bashrc中设置了以下内容:

export SPARK_HOME=/home/test/spark3.1.1
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=python3
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk

如果我试着从jupyter笔记本运行它,如下所示:

import pyspark
from pyspark.sql import SparkSession

# starting deamons for standalone

!/home/test/spark3.1.1/sbin/start-master.sh
!/home/test/spark3.1.1/sbin/start-worker.sh spark://JupyterHub:7077

# spark standalone

spark = SparkSession.builder \
        .appName("test") \
        .master("spark://JupyterHub:7077")\
        .config("spark.cores.max","5")\
        .config("spark.executor.memory","2g")\
        .config("spark.jars.packages",'org.elasticsearch:elasticsearch-spark-30_2.12:7.12-SNAPSHOT')\
        .config("spark.executor.cores","5")\
        .enableHiveSupport() \
        .getOrCreate()

我得到以下错误:

ModuleNotFoundError: No module named 'pyspark'

但我不明白为什么,因为我在bash中指定了spark文件夹中python文件的路径,并确保发生了更改。此外,我在闲逛,试图使用图书馆 findspark ,现在,如果我尝试使用添加的导入运行此代码:

import findspark
spark_location='/home/test/spark3.1.1/' 
findspark.init(spark_home=spark_location)
import pyspark
from pyspark.sql import SparkSession

# starting deamons for standalone

!/home/test/spark3.1.1/sbin/start-master.sh
!/home/test/spark3.1.1/sbin/start-worker.sh spark://JupyterHub:7077

# spark standalone

spark = SparkSession.builder \
        .appName("test") \
        .master("spark://JupyterHub:7077")\
        .config("spark.cores.max","5")\
        .config("spark.executor.memory","2g")\
        .config("spark.jars.packages",'org.elasticsearch:elasticsearch-spark-30_2.12:7.12.0-SNAPSHOT')\
        .config("spark.executor.cores","5")\
        .enableHiveSupport() \
        .getOrCreate()

它看起来能够找到pyspark,但是这是0,因为我已经在bash文件中指定了所有内容,并且已经将spark\u设置为home,但是我得到了另一个错误:

starting org.apache.spark.deploy.master.Master, logging to /home/test/spark3.1.1//logs/spark-test-org.apache.spark.deploy.master.Master-1-JupyterHub.out
starting org.apache.spark.deploy.worker.Worker, logging to /home/test/spark3.1.1//logs/spark-test-org.apache.spark.deploy.worker.Worker-1-JupyterHub.out

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-7-7d402e7d71bf> in <module>
     10 
     11 #spark standalone
---> 12 spark = SparkSession.builder \
     13         .appName("test") \
     14         .master("spark://JupyterHub:7077")\

~/spark3.1.1/python/pyspark/sql/session.py in getOrCreate(self)
    226                             sparkConf.set(key, value)
    227                         # This SparkContext may be an existing one.
--> 228                         sc = SparkContext.getOrCreate(sparkConf)
    229                     # Do not update `SparkConf` for existing `SparkContext`, as it's shared
    230                     # by all sessions.

~/spark3.1.1/python/pyspark/context.py in getOrCreate(cls, conf)
    382         with SparkContext._lock:
    383             if SparkContext._active_spark_context is None:
--> 384                 SparkContext(conf=conf or SparkConf())
    385             return SparkContext._active_spark_context
    386 

~/spark3.1.1/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    142                 " is not allowed as it is a security risk.")
    143 
--> 144         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    145         try:
    146             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

~/spark3.1.1/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
    329         with SparkContext._lock:
    330             if not SparkContext._gateway:
--> 331                 SparkContext._gateway = gateway or launch_gateway(conf)
    332                 SparkContext._jvm = SparkContext._gateway.jvm
    333 

~/spark3.1.1/python/pyspark/java_gateway.py in launch_gateway(conf, popen_kwargs)
    106 
    107             if not os.path.isfile(conn_info_file):
--> 108                 raise Exception("Java gateway process exited before sending its port number")
    109 
    110             with open(conn_info_file, "rb") as info:

Exception: Java gateway process exited before sending its port number

我已经查过了upyterhub:7077 on 默认的8080端口和一切都是活的,所以我成功地启动了主和工人。
即使在本地模式下运行spark master("local[*]") 我得到了上面同样的错误
我完全迷路了,你知道为什么我不能在shell和jupyter笔记本上运行pyspark吗?
谢谢

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题