如何在Apache Spark中安装pyspark.pandas?

63lcw9qa  于 2022-12-02  发布在  Apache
关注(0)|答案(1)|浏览(231)

我下载了Apache Spark 3.3.0软件包,其中包含pyspark

$ pyspark

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.0
      /_/

Using Python version 3.7.10 (default, Jun  3 2021 00:02:01)
Spark context Web UI available at http://XXX-XXX-XXX-XXXX.compute.internal:4041
Spark context available as 'sc' (master = local[*], app id = local-1669908157343).
SparkSession available as 'spark'.
**>>> import pyspark.pandas as ps**
Traceback (most recent call last):
  File "/home/ec2-user/bin/spark/latest/python/pyspark/sql/pandas/utils.py", line 27, in require_minimum_pandas_version
    import pandas
ModuleNotFoundError: No module named 'pandas'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ec2-user/bin/spark/latest/python/pyspark/pandas/__init__.py", line 31, in <module>
    require_minimum_pandas_version()
  File "/home/ec2-user/bin/spark/latest/python/pyspark/sql/pandas/utils.py", line 36, in require_minimum_pandas_version
    ) from raised_error
ImportError: Pandas >= 1.0.5 must be installed; however, it was not found.

我如何将python包导入Apache-Spark中的自定义目录,如/home/ec2-user/bin/spark/latest/python/pyspark?
我也试探着:$ pip安装Pandas-bash:pip:未找到命令
如果我尝试安装pip,如何确保这些库与Spark中的Python版本3.7.20兼容?

qyzbxkaa

qyzbxkaa1#

您是否尝试过以以下方式安装Pandas:

pip install pyspark[pandas_on_spark]

如果bash无法发现pip,可以尝试先激活Python环境(无论是 virtualenvconda 还是其他任何东西)。

相关问题