我做了R & D,并尝试了下面的代码:
DocumentDB中的数据为收款单ID、姓名、年份、部门等。
# Reading data from DocumentDB database
uri = "mongodb://AdminDB:Admin123@empcluster.cluster-cazza3k7n7jk.us-east-1.docdb.amazonaws.com:27017/test.emp"
spark = SparkSession.builder \
.appName("MongoDBIntegration") \
.config("spark.mongodb.input.uri", "mongodb://AdminDB:Admin123@empcluster.cluster-cazza3k7n7jk.us-east-1.docdb.amazonaws.com:27017/test.emp") \
.config("spark.mongodb.output.uri", "mongodb://AdminDB:Admin123@empcluster.cluster-cazza3k7n7jk.us-east-1.docdb.amazonaws.com:27017/test.emp") \
.getOrCreate()
client = MongoClient("mongodb://AdminDB:Admin123@empcluster.cluster-cazza3k7n7jk.us-east-1.docdb.amazonaws.com:27017/test.emp")
db = client["test"]
collection = db["emp"]
df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", "mongodb://AdminDB:Admin123@empcluster.cluster-cazza3k7n7jk.us-east-1.docdb.amazonaws.com:27017/test.emp").load()
df2.show()
但得到的错误:
有没有其他方法可以解决这个问题或者从数据库中读取数据?
An error was encountered:
An error occurred while calling o199.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:675)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:213)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
1条答案
按热度按时间wvt8vs2t1#
当你想通过Spark查询外部数据库时,你需要一个JDBC。你可以根据你的版本检查所需的jar:https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector
在创建spark session时,需要在配置中设置
.config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.13:10.1.1")
,或者可以将jar放到spark jars目录中。另外还需要做其他配置,可以查看详情:https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html