如何使用spark将数据从oracle数据库导入dataframe或rdd,然后将这些数据写入某个配置单元表?
我有相同的代码:
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Data transfer test (Oracle -> Hive)").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
HashMap<String, String> options = new HashMap<>();
options.put("url", "jdbc:oracle:thin:@<ip>:<port>:orcl");
options.put("dbtable", "ACCOUNTS");
options.put("user", "username");
options.put("password", "12345");
options.put("driver", "oracle.jdbc.OracleDriver");
options.put("numPartitions", "4");
DataFrame oracleDataFrame = sqlContext.read()
.format("jdbc")
.options(options)
.load();
}
如果我创建一个hivecontext示例来使用hive
HiveContext hiveContext = new HiveContext(sc);
我也犯了同样的错误:
ERROR conf.Configuration: Failed to set setXIncludeAware(true) for parser oracle.xml.jaxp.JXDocumentBuilderFactory@51be472e:java.lang .UnsupportedOperationException: setXIncludeAware is not supported on this JAXP implementation or earlier: class oracle.xml.jaxp.JXDocumentBuilderFacto ry
java.lang.UnsupportedOperationException: setXIncludeAware is not supported on this JAXP implementation or earlier: class oracle.xml.jaxp.JXDocumentBui lderFactory
at javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:614)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2534)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2503)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2409)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1144)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1116)
at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:525)
at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:543)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:437)
at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:2750)
at org.apache.hadoop.hive.conf.HiveConf.<init>(HiveConf.java:2713)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:185)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249)
at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:329)
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:239)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:443)
at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:271)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:103)
at replicator.ImportFromOracleToHive.init(ImportFromOracleToHive.java:52)
at replicator.ImportFromOracleToHive.main(ImportFromOracleToHive.java:76)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:730)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
1条答案
按热度按时间bt1cpqcv1#
这个问题似乎是一个过时的xerces依赖的问题,在这个问题中有详细说明。我猜你是以某种方式间接地把它拉进来的,但是如果没有看到你的眼睛就不可能说出来
pom.xml
. 您将从发布的堆栈跟踪中注意到,错误源于hadoop公共代码Configuration
对象,而不是Spark本身。解决方法是确保您使用的是足够新的版本。