3小时

fnatzsnv 于 2021-06-20 发布在 Mysql

关注(0)|答案(1)|浏览(278)

我有spark proses在做一些计算，然后它在mysql表中做一个插入，所有的计算都在40-50分钟内完成，但是。写入表的时间为2-3小时（取决于数据库使用情况）。我试着做批量大小

val db_url_2 = "jdbc:mysql://name.amazonaws.com:port/db_name?rewriteBatchedStatements=true" 

df_trsnss.write.format("jdbc").option("url", db_url_2).option("dbtable", output_table_name).option("user", db_user).option("password", db_pwd).option("truncate","true").option("batchsize", 5000).mode("overwrite").save()

但仍然需要很长时间才能加载，我不能每天花2-4个小时来计算并将数据写入表中。
有没有办法加快这一进程？
开始考虑写进csv，然后从csv加载到db，这样我可以减少emr时间。

mysql apache-spark amazon-emr amazon-web-services

来源：https://stackoverflow.com/questions/52826038/spark-loading-150-million-records-into-mysql-taking-2-3-hours

1条答案

按热度按时间

l7wslrjt1#

试着这样做-事实上，就在databricks指南中：
jdbc写入
spark的分区规定了用于通过jdbcapi推送数据的连接数。根据现有的分区数，可以通过调用coalesce（）或repartition（）来控制并行性。减少分区数时调用coalesce，增加分区数时调用repartition。
试着看看这与你的写作方法相比如何，然后告诉我们。

import org.apache.spark.sql.SaveMode

val df = spark.table("diamonds")
println(df.rdd.partitions.length)

// Given the number of partitions above, you can reduce the partition value by calling coalesce() or increase it by calling repartition() to manage the number of connections.
df.repartition(10).write.mode(SaveMode.Append).jdbc(jdbcUrl, "diamonds", connectionProperties)

赞(0）回复(0）举报 2021-06-20

我来回答

3小时

1条答案

相关问题

热门标签

最新问答