我需要在我的机器上设置pyspark来访问和读取远程hadoop集群上的数据,但是我遇到了一些问题。
这是我遵循的步骤
brew install apache-spark
2)export SPARK_HOME=/usr/local/Cellar/apache-spark/1.6.1 export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
3)
export HADOOP_USER_NAME=hdfs
export HADOOP_CONF_DIR=yarnconfig
在 yarnconfig
我有这个 yarn-site.xml
```
yarn.resourcemanager.hostname
{Hadoop_Cluster_IP}
yarn.resourcemanager.address
${yarn.resourcemanager.hostname}:8050
在这里 `{Hadoop_Cluster_IP}` 是我尝试连接到的hadoop集群的ip地址的占位符,出于安全原因,我不显示它。
然后,在python shell中
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local").setAppName("LogParser")
sc=sparkcontext(conf=conf)
但我得到以下错误信息
/usr/local/Cellar/apache-spark/1.6.1/bin/load-spark-env.sh: line 2: /usr/local/Cellar/apache-spark/1.6.1/libexec/bin/load-spark-env.sh: Permission denied
/usr/local/Cellar/apache-spark/1.6.1/bin/load-spark-env.sh: line 2: exec: /usr/local/Cellar/apache-spark/1.6.1/libexec/bin/load-spark-env.sh: cannot execute: Undefined error: 0
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/conf.py", line 104, in init
SparkContext._ensure_initialized()
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/java_gateway.py", line 94, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
你知道哪里出了问题吗?
暂无答案!
目前还没有任何答案,快来回答吧!