org.apache.hadoop.hive.ql.metadata.hive.loaddynamicpartitions从spark(2.11)Dataframe写入配置单元分区表时出现异常

siv3szwd  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(319)

我有一个奇怪的行为,我的用例是使用

sqlContext.sql("INSERT OVERWRITE TABLE <table> PARTITION (<partition column) SELECT * FROM <temp table from dataframe>")

奇怪的是,当从主机a使用pyspark shell时,这种方法是有效的,但是相同的代码,连接到相同的集群,使用相同的配置单元表,在jupyter笔记本中不起作用,它返回:

java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions

在我看来,这个例外是因为pyspark shell启动的主机和jupyter启动的主机之间的jar不匹配,我的问题是,如何通过代码确定pyspark shell和jupyter笔记本中使用的是哪个版本的对应jar(我无法访问jupyter服务器)?如果pyspark shell和jupyter都连接到同一个集群,为什么会使用两个不同的版本呢?
更新:经过一番研究,我发现jupyter使用的是“livy”,livy host使用的是hive-exec-2.0.1.jar,我们使用pyspark shell的主机使用的是hive-exec-1.2.1000.2.5.3.58-3.jar,所以我从maven repository下载了这两个jar并反编译了它们,我发现虽然两个jar中都存在loaddynamicpartitions方法,但方法签名(参数)不同,在livy版本中,缺少boolean holddltime参数。

weylhg0b

weylhg0b1#

我在尝试从cloudera获取maven依赖项时遇到了类似的问题

<dependencies>
    <!-- Scala and Spark dependencies -->

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>1.6.0-cdh5.9.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.10</artifactId>
        <version>1.6.0-cdh5.9.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-hive_2.10</artifactId>
        <version>1.6.0-cdh5.9.2</version>
    </dependency>
     <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>1.1.0-cdh5.9.2</version>
    </dependency>
    <dependency>
        <groupId>org.scalatest</groupId>
        <artifactId>scalatest_2.10</artifactId>
        <version>3.0.0-SNAP4</version>
    </dependency>
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.11</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_2.10</artifactId>
        <version>1.4.1</version>
    </dependency>
    <dependency>
        <groupId>commons-dbcp</groupId>
        <artifactId>commons-dbcp</artifactId>
        <version>1.2.2</version>
    </dependency>
    <dependency>
        <groupId>com.databricks</groupId>
        <artifactId>spark-csv_2.10</artifactId>
        <version>1.4.0</version>
    </dependency>
    <dependency>
        <groupId>com.databricks</groupId>
        <artifactId>spark-xml_2.10</artifactId>
        <version>0.2.0</version>
    </dependency>
    <dependency>
        <groupId>com.amazonaws</groupId>
        <artifactId>aws-java-sdk</artifactId>
        <version>1.0.12</version>
    </dependency>
    <dependency>
        <groupId>com.amazonaws</groupId>
        <artifactId>aws-java-sdk-s3</artifactId>
        <version>1.11.172</version>
    </dependency>
    <dependency>
        <groupId>com.github.scopt</groupId>
        <artifactId>scopt_2.10</artifactId>
        <version>3.2.0</version>
    </dependency>
    <dependency>
        <groupId>javax.mail</groupId>
        <artifactId>mail</artifactId>
        <version>1.4</version>
    </dependency>
</dependencies>
<repositories>
    <repository>
        <id>maven-hadoop</id>
        <name>Hadoop Releases</name>
        <url>https://repository.cloudera.com/content/repositories/releases/</url>
    </repository>
    <repository>
        <id>cloudera-repos</id>
        <name>Cloudera Repos</name>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
</repositories>

相关问题