我正在同一个网络(网络名:test cs)上运行cassandra和pyspark容器,docker上有以下命令:
docker run --name cassandra -v $HOME/Documents/datastax/cassandra:/var/lib/cassandra --network test-cs -d datastax/cassandra:4:0
docker run --name pyspark -p 8888:8888 -p 4040:4040 -p 4041:4041 -p 4042:4042 -e CHOWN_HOME=yes -e GRANT_SUDO=yse -e NB_GID=1000 -e NB_GID=100 -v $HOME/Documents/spark:/home/jovyan/work --network test-cs jupyter/pyspark-notebook
我想从cassandra表中的表中读取数据,因此我使用jupyter笔记本上的pyspark代码将spark连接到cassandra:
# Configuratins related to Cassandra connector & Cluster
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.0 --conf spark.cassandra.connection.host=cassandra pyspark-shell'
请注意,我对这些代码的“spark.cassandra.connection.host”参数使用cassandra(cassandra容器名)值,而不是ip(127.0.0.1)。
# Creating PySpark Context
from pyspark import SparkContext
sc = SparkContext("local", "movie lens app")
# Creating PySpark SQL Context
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
# Loads and returns data frame for a table including key space given
def load_and_get_table_df(keys_space_name, table_name):
table_df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=table_name, keyspace=keys_space_name)\
.load()
return table_df
# Loading movies & ratings table data frames
movies = load_and_get_table_df("movie_lens", "movies")
ratings = load_and_get_table_df("movie_lens", "ratings")
运行上述代码后,我看到错误,无法从cassandra读取数据并连接到它。请帮帮我,因为我对Pypark和cassandra的交流非常初级。
暂无答案!
目前还没有任何答案,快来回答吧!