spark应用程序将从hive加载数据:
SparkSession spark = SparkSession.builder()
.appName(topics)
.config("hive.metastore.uris", "thrift://device1:9083")
.enableHiveSupport()
.getOrCreate();
我通过以下方式启动Spark:
spark-submit --master local[*] --class zhihu.SparkConsumer target/original-kafka-consumer-0.1-SNAPSHOT.jar --jars spark-hive_2.11-2.4.4.jar
maven pom.xml文件
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.zhihu</groupId>
<artifactId>kafka-consumer</artifactId>
<packaging>jar</packaging>
<version>0.1-SNAPSHOT</version>
<name>kafkadev</name>
<url>http://maven.apache.org</url>
<repositories>
<repository>
<!-- Proper URL for Cloudera maven artifactory -->
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-core -->
<!-- https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-api -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>2.8.2</version>
</dependency>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.4</version>
<scope>compile</scope>
</dependency>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.4.4</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.4</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.1.0</version>
<exclusions>
<exclusion>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.log4j</groupId>
<artifactId>log4j-core</artifactId>
</exclusion>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
<scope>compile</scope>
</dependency>
<!-- gson -->
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.2</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-metastore</artifactId>
<version>2.1.1-cdh6.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-service</artifactId>
<version>2.1.1-cdh6.2.0</version>
</dependency>
<!-- runtime Hive -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-common</artifactId>
<version>2.1.1-cdh6.2.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-beeline</artifactId>
<version>2.1.1-cdh6.2.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>2.1.1-cdh6.2.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-shims</artifactId>
<version>2.1.1-cdh6.2.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.1.1-cdh6.2.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-serde</artifactId>
<version>2.1.1-cdh6.2.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-contrib</artifactId>
<version>2.1.1-cdh6.2.0</version>
<scope>runtime</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.7.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>**/Log4j2Plugins.dat</exclude>
</excludes>
</filter>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<artifactSet>
<excludes>
<exclude>classworlds:classworlds</exclude>
<exclude>junit:junit</exclude>
<exclude>jmock:*</exclude>
<exclude>*:xml-apis</exclude>
<exclude>org.apache.maven:lib:tests</exclude>
</excludes>
</artifactSet>
<skip>true</skip>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
看起来没什么问题,但总是会引起:
20/05/07 12:03:17 INFO spark.SparkContext: Added JAR file:/data/projects/zhihu_scraper/consumers/target/original-kafka-consumer-0.1-SNAPSHOT.jar at spark://device2:42395/jars/original-kafka-consumer-0.1-SNAPSHOT.jar with timestamp 1588824197724
20/05/07 12:03:17 INFO executor.Executor: Starting executor ID driver on host localhost
20/05/07 12:03:17 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33849.
20/05/07 12:03:17 INFO netty.NettyBlockTransferService: Server created on device2:33849
20/05/07 12:03:17 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/05/07 12:03:17 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, device2, 33849, None)
20/05/07 12:03:17 INFO storage.BlockManagerMasterEndpoint: Registering block manager device2:33849 with 366.3 MB RAM, BlockManagerId(driver, device2, 33849, None)
20/05/07 12:03:17 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, device2, 33849, None)
20/05/07 12:03:17 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, device2, 33849, None)
20/05/07 12:03:17 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@63e5e5b4{/metrics/json,null,AVAILABLE,@Spark}
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
at org.apache.spark.sql.SparkSession$Builder.enableHiveSupport(SparkSession.scala:869)
at zhihu.SparkConsumer.main(SparkConsumer.java:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/05/07 12:03:18 INFO spark.SparkContext: Invoking stop() from shutdown hook
在这篇文章中,我已经尝试了所有的答案,如何使用hive支持创建sparksession。但是,他们都不为我工作。
1条答案
按热度按时间o2g1uqev1#
我不知道为什么
compile
它应该是什么范围runtime
. 因为您使用的是maven shade插件,所以可以打包uber jar(使用target/original-kafka-consumer-0.1-SNAPSHOT.jar
)将所有依赖项放在一个伞/归档中,并且它将位于类路径中,这样就不会遗漏任何内容。也
hive-site.xml
应该在类路径中。那么就不需要单独配置metastoreuris了。以编程的方式。进一步阅读