sparkDataframe中时间戳列合并的最有效方法

krugob8w  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(411)

在sparkDataframe中合并两列最有效的方法是什么?
我有两列意思相同。中的空值 timestamp 应该用中的值填充 toAppendData_timestamp 当两列都有值时,意味着值相等。。。
我有这个:

+--------------------+----------------------+--------+
|           timestamp|toAppendData_timestamp|   value|
+--------------------+----------------------+--------+
|2016-03-24 22:11:...|                  null|    null|
|                null|  2016-03-24 22:12:...|0.015625|
|                null|  2016-03-19 15:54:...|   5.375|
|2016-03-19 15:55:...|  2016-03-19 15:55:...| 5.78125|
|2016-03-19 15:56:...|                  null|    null|
|2016-03-24 22:11:...|  2016-03-24 22:11:...| 0.15625|
+--------------------+----------------------+--------+

我需要这个:

+--------------------+----------------------+--------+
|    timestamp_merged|toAppendData_timestamp|   value|
+--------------------+----------------------+--------+
|2016-03-24 22:11:...|                  null|    null|
|2016-03-24 22:12:...|  2016-03-24 22:12:...|0.015625|
|2016-03-19 15:54:...|  2016-03-19 15:54:...|   5.375|
|2016-03-19 15:55:...|  2016-03-19 15:55:...| 5.78125|
|2016-03-19 15:56:...|                  null|    null|
|2016-03-24 22:11:...|  2016-03-24 22:11:...| 0.15625|
+--------------------+----------------------+--------+

我试过,但没有成功:

appendedData = appendedData['timestamp'].fillna(appendedData['toAppendData_timestamp'])
2ledvvac

2ledvvac1#

你要找的功能是 coalesce . 你可以从 pyspark.sql.functions :

from pyspark.sql.functions import coalesce, col

使用方法:

appendedData.withColumn(
    'timestamp_merged', 
    coalesce(col('timestamp'), col('toAppendData_timestamp'))
)

相关问题