我正在编写一个spark代码,从一个dataframe插入到一个mysql表中,我得到了一个错误
df_to_write.write.format("jdbc").option("url", jdbc_url).option("driver", "com.mysql.jdbc.Driver")\
.option("dbtable", tbl_name).option("user", DBUser).option("password", DBPassword)\
.option("numPartitions",32).mode('append').save()
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 18.0 failed 4 times, most recent failure: Lost task 1.3 in stage 18.0 (TID 170, 10.151.244.77, executor 0): java.sql.BatchUpdateException: Data truncation: Incorrect datetime value: '1970-01-01 00:00:00' for column
我发现这个是因为列的值是1970-01-01 00:00:00
但是,如果我在mysql workbench中运行insert语句,它将发出警告,并且值将作为0000插入
示例如下:
create table test_time (processing_ts timestamp NULL DEFAULT NULL);
insert into test_time values ('1990-01-01T00:00:00.000+0000'),('1990-01-01T00:00:00.000+0000'),('1970-01-01T00:00:00.000+0000');
表的输出值如下所示
1990-01-01 00:00:00
1990-01-01 00:00:00
0000-00-00 00:00:00 --> no error only warning and changed the value to 0000:00:00 00:00:00
想知道我可以对spark应用什么设置来获得相同的行为,即应该没有错误,并且值应该设置为0000-00-00:00:00。写入表时,是否仍要在spark中应用insert ignore选项。
根据注解,粘贴模拟代码
sql语句
mysql> create table test (processing_ts timestamp null);
Query OK, 0 rows affected (0.05 sec)
mysql> select * from test;
+---------------------+
| processing_ts |
+---------------------+
| 1997-02-28 10:30:00 |
+---------------------+
1 row in set (0.00 sec)
df = spark.createDataFrame([('1970-01-01 00:00:00',)], ['processing_ts'])
df2 = df.select(f.to_timestamp(df.processing_ts, 'yyyy-MM-dd HH:mm:ss').alias('processing_ts'))
db_host = '127.0.0.1'
DBName = 'test'
jdbc_url = "jdbc:mysql://{}/{}".format(db_host ,DBName)
DBUser = 'XXXXXX'
DBPassword = 'XXXXXXX123'
tbl_name = 'test'
df2.write.format("jdbc").option("url", jdbc_url).option("driver", "com.mysql.jdbc.Driver")\
.option("dbtable", tbl_name).option("user", DBUser).option("password", DBPassword)\
.option("numPartitions",32).mode('append').save()
1条答案
按热度按时间8wigbo561#
时间戳值的范围是
1970-01-01 00:00:01.000000
至2038-01-19 03:14:07.999999
mysql文档。jdbc驱动程序可能无法处理相同的操作,因此它正在中止插入作业。这里你唯一能做的就是替换时间戳
1970-01-01 00:00:00
因为支持的范围是1000-01-01 00:00:00
至9999-12-31 23:59:59
.