dataproc集群中的scala spark作业返回java.util.nosuchelementexception:none.get

rvpgvaaj  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(499)

我得到了错误

ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.util.NoSuchElementException: None.get

当我使用dataproc集群运行我的作业时,当我在本地运行它时,它运行得非常好。我用下面的玩具示例重新创建了这个问题。

package com.deequ_unit_tests

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession

object reduce_by_key_example {def main(args: Array[String]): Unit = {

  // Set the log level to only print errors
  Logger.getLogger("org").setLevel(Level.ERROR)

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  println("Step 1")
  val data = Seq(("Project", 1),
    ("Gutenberg’s", 1),
    ("Alice’s", 1),
    ("Adventures", 1),
    ("in", 1),
    ("Wonderland", 1),
    ("Project", 1),
    ("Gutenberg’s", 1),
    ("Adventures", 1),
    ("in", 1),
    ("Wonderland", 1),
    ("Project", 1),
    ("Gutenberg’s", 1))

  println("Step 2")
  val rdd = spark.sparkContext.parallelize(data)

  println("Step 3")
  val rdd2 = rdd.reduceByKey(_ + _)

  println("Step 4")
  rdd2.foreach(println)
  }
}

当我在dataproc中运行这个作业时,我在执行行时得到这个错误

rdd2.foreach(println)

作为补充信息,我不得不说,在我公司的dataproc集群中应用了一些更改之前,我没有收到这个错误。对于使用pyspark的同事,使用上面示例的pyspark中的等效版本

sc = SparkContext('local')

sc = SparkContext()

我成功了,但在spark scala中找不到等价的解决方案。你知道是什么导致了这个问题吗?欢迎任何帮助。

mepcadol

mepcadol1#

按如下方式配置pom.xml或build.sbt:
在脚本中添加提供的作用域:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>stackOverFlowGcp</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.2.3</version>
            <scope>provided</scope>

        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.2.3</version>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>com.typesafe</groupId>
            <artifactId>config</artifactId>
            <version>1.4.0</version>
            <scope>provided</scope>

        </dependency>

    </dependencies>

    <build>
        <plugins>
            <!-- Maven Plugin -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>
            <!-- assembly Maven Plugin -->
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>mainPackage.mainObject</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

        </plugins>

    </build>

</project>

创建包:clean=>rebuild=>compile=>package

package mainPackage
import org.apache.spark.sql.SparkSession

object mainObject {

  def main(args: Array[String]): Unit = {

    val spark: SparkSession = SparkSession.builder()
      //.master("local[*]")
      .appName("SparkByExamples")
      .getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")

    println("Step 1")
    val data = Seq(("Project", 1),
      ("Gutenberg’s", 1),
      ("Alice’s", 1),
      ("Adventures", 1),
      ("in", 1),
      ("Wonderland", 1),
      ("Project", 1),
      ("Gutenberg’s", 1),
      ("Adventures", 1),
      ("in", 1),
      ("Wonderland", 1),
      ("Project", 1),
      ("Gutenberg’s", 1))

    println("Step 2")
    val rdd = spark.sparkContext.parallelize(data)
    println("Step 3")
    val rdd2 = rdd.reduceByKey(_ + _)

    println("Step 4")
    rdd2.foreach(println)

  }
}

创建dataproc集群
在dataproc中运行spark作业
在dataproc中,您不会看到前面提到的结果,如果您想知道,请阅读更多关于dataproc方法的内容。但是,如果愿意,可以在dataproc中显示Dataframe。




正如您在dataproc中看到的,每件事情都很好地工作。不要忘记关闭集群或在完成后删除它;)

相关问题