如何从cassandra读取运行在docker上的pyspark数据?

bvjxkvbb  于 2021-05-19  发布在  Spark
关注(0)|答案(0)|浏览(350)

我正在同一个网络(网络名:test cs)上运行cassandra和pyspark容器,docker上有以下命令:

docker run --name cassandra -v $HOME/Documents/datastax/cassandra:/var/lib/cassandra --network test-cs -d datastax/cassandra:4:0

docker run --name pyspark -p 8888:8888 -p 4040:4040 -p 4041:4041 -p 4042:4042 -e CHOWN_HOME=yes -e GRANT_SUDO=yse -e NB_GID=1000 -e NB_GID=100 -v $HOME/Documents/spark:/home/jovyan/work --network test-cs jupyter/pyspark-notebook

我想从cassandra表中的表中读取数据,因此我使用jupyter笔记本上的pyspark代码将spark连接到cassandra:


# Configuratins related to Cassandra connector & Cluster

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.0 --conf spark.cassandra.connection.host=cassandra pyspark-shell'

请注意,我对这些代码的“spark.cassandra.connection.host”参数使用cassandra(cassandra容器名)值,而不是ip(127.0.0.1)。


# Creating PySpark Context

from pyspark import SparkContext
sc = SparkContext("local", "movie lens app")

# Creating PySpark SQL Context

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

# Loads and returns data frame for a table including key space given

def load_and_get_table_df(keys_space_name, table_name):
    table_df = sqlContext.read\
        .format("org.apache.spark.sql.cassandra")\
        .options(table=table_name, keyspace=keys_space_name)\
        .load()
    return table_df

# Loading movies & ratings table data frames

movies = load_and_get_table_df("movie_lens", "movies")
ratings = load_and_get_table_df("movie_lens", "ratings")

运行上述代码后,我看到错误,无法从cassandra读取数据并连接到它。请帮帮我,因为我对Pypark和cassandra的交流非常初级。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题