如何使用Pyspark在AWS Jupyter笔记本中从DocumentDB AWS读取数据?

7lrncoxx  于 2023-03-22  发布在  Spark
关注(0)|答案(1)|浏览(144)

我做了R & D,并尝试了下面的代码:
DocumentDB中的数据为收款单ID、姓名、年份、部门等。

# Reading data from DocumentDB database
uri = "mongodb://AdminDB:Admin123@empcluster.cluster-cazza3k7n7jk.us-east-1.docdb.amazonaws.com:27017/test.emp"
spark = SparkSession.builder \
    .appName("MongoDBIntegration") \
    .config("spark.mongodb.input.uri", "mongodb://AdminDB:Admin123@empcluster.cluster-cazza3k7n7jk.us-east-1.docdb.amazonaws.com:27017/test.emp") \
    .config("spark.mongodb.output.uri", "mongodb://AdminDB:Admin123@empcluster.cluster-cazza3k7n7jk.us-east-1.docdb.amazonaws.com:27017/test.emp") \
    .getOrCreate()

client = MongoClient("mongodb://AdminDB:Admin123@empcluster.cluster-cazza3k7n7jk.us-east-1.docdb.amazonaws.com:27017/test.emp")
db = client["test"]
collection = db["emp"]

df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", "mongodb://AdminDB:Admin123@empcluster.cluster-cazza3k7n7jk.us-east-1.docdb.amazonaws.com:27017/test.emp").load()
df2.show()

但得到的错误:
有没有其他方法可以解决这个问题或者从数据库中读取数据?

An error was encountered:
An error occurred while calling o199.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:675)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:213)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
wvt8vs2t

wvt8vs2t1#

当你想通过Spark查询外部数据库时,你需要一个JDBC。你可以根据你的版本检查所需的jar:https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector
在创建spark session时,需要在配置中设置.config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.13:10.1.1"),或者可以将jar放到spark jars目录中。另外还需要做其他配置,可以查看详情:https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

相关问题