摘要
我试图在一个all spark笔记本中执行这个简单的python代码片段,这个笔记本应该在本地spark集群中执行,我在docker compose文件中设置了这个集群。然而,我得到了错误 ModuleNotFoundError: No module named 'pyspark'
这对我来说毫无意义,因为在这个dockerfile(我从dockerrepos文档中获取)中,我显式地用pip安装了pyspark。
重现错误的步骤
# Clone the repository and checkout a specific commit
kevinsuedmersen@LAPTOP-XXX:~/dev/hadoop-sandbox$ git clone https://github.com/kevinsuedmersen/hadoop-sandbox.git
kevinsuedmersen@LAPTOP-XXX:~/dev/hadoop-sandbox$ git checkout e0a061dd3a60842aa0e93893892c7e0844c2278a
# Install and start all services
kevinsuedmersen@LAPTOP-XXX:~/dev/hadoop-sandbox$ docker-compose up -d
# Entering the container running the notebooks
kevinsuedmersen@LAPTOP-XXX:~/dev/hadoop-sandbox$ docker exec -it jupyter-spark bash
# Activating the custom python environment installed in the above referenced Dockerfile
(base) jovyan@XXX:~$ conda activate python37
# Start a jupyter notebook server
(python37) jovyan@XXX:~$ jupyter notebook
# After some logging, the following output shows
To access the notebook, open this file in a browser:
file:///home/jovyan/.local/share/jupyter/runtime/nbserver-27913-open.html
Or copy and paste one of these URLs:
http://b8ef36545270:8889/?token=some_token
or http://127.0.0.1:8889/?token=some_token
然后,我点击网址 http://127.0.0.1:8889/?token=some_token
要在我的浏览器中打开jupytergui,请执行简单的python代码片段并获得上面解释的错误。
我试过的
为了检查pyspark是否真的安装了,我基本上只是尝试在jupyterspark容器的shell中执行简单的python代码片段,令人惊讶的是,它成功了。具体来说,我在一个新shell中执行了以下命令:
# Entering into the jupyter-spark container and activating the custom python environment
kevinsuedmersen@LAPTOP-XXX:~/dev/hadoop-sandbox$ docker exec -it jupyter-spark bash
(base) jovyan@XXX:~$ conda activate python37
# Opening a python shell
(python37) jovyan@XXX:~$ python
# Copy pasting the same commands from the notebook into the shell
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.master('spark://spark-master:7077').getOrCreate()
>>> sc = spark.sparkContext
>>> rdd = sc.parallelize(range(100 + 1))
>>> rdd.sum()
5050
此外,我注意到在笔记本中执行以下操作
! python --version
印刷品 Python 3.8.8
所以,我的问题是:如何让笔记本使用定制的python环境?
1条答案
按热度按时间kzmpq1sx1#
因此,很明显,以下解决方法是有效的:
将jupyter spark服务的dockerfile更改为如下简单内容:
中的服务定义
docker-compose.yml
文件变为:这里可以显示具有上述更改的存储库的当前工作状态