我有如下cloudera集群规范:
我创建了简单的sparksql应用程序来连接配置单元表。表是外部表。对于healtpersonalcare\u reviews表,数据是用json文件编写的;对于healtpersonalcare\u ratings表,数据是用csv格式(115mb)编写的。这是我的密码:
val warehouseLocation = "/hive/warehouse"
var args_list = args.toList
var conf = new SparkConf()
.set("spark.sql.warehouse.dir", warehouseLocation)
.set("spark.kryoserializer.buffer.max","1024m")
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config(conf)
.enableHiveSupport()
.getOrCreate()
val table_view_name = args_list(0)
val limit = args_list(1)
val df_addjar = spark.sql("ADD JAR /opt/cloudera/parcels/CDH/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar")
var df_use =spark.sql("use testing")
var df = spark.sql("SELECT hp.asin, hp.helpful,hp.overall,hp.reviewerid,hp.reviewername,hp.reviewtext,hp.reviewtime,hp.summary,hp.unixreviewtime FROM testing.healtpersonalcare_reviews hp LEFT JOIN testing.health_ratings hr ON (hp.reviewerid = hr.reviewerid) ")
var df_create_join_table = spark.sql("CREATE TABLE IF NOT EXISTS healtpersonalcare_joins (asin string,helpful array<int>,overall double,reviewerid string,reviewername string,reviewtext string,reviewtime string,summary string,unixreviewtime int)")
df.cache()
df.collect().foreach(println)
System.exit(0)
我使用以下命令运行应用程序:
spark submit--class org.sia.chapter03app.app--master yarn--deploy mode client--executor memory 1024m--driver memory 1024m--conf spark.driver.maxresultsize=2g--verbose/root/sparktest/original-chapter03app-0.0.1-snapshot.jar name 10
我尝试使用值的变化--执行器内存和--驱动程序内存
对于“--executor memory 1024m--driver memory 1024m”i get error“java.lang.outofmemoryerror:java堆空间”
对于“--executor memory 2048m--driver memory 2048m”,线程“main”java.lang.outofmemoryerror:超出gc开销限制
有人遇到过这样的问题吗?解决办法是什么?谢谢。
暂无答案!
目前还没有任何答案,快来回答吧!