Python:如果有空值,如何将Pyspark列转换为日期类型

bqjvbblv  于 2023-03-09  发布在  Apache
关注(0)|答案(2)|浏览(129)

在pyspark中,我有一个 Dataframe ,其中的日期以字符串形式导入。在这些以字符串形式导入的日期列中有空值。我试图将这些列转换为日期类型列,但我一直收到错误。下面是该 Dataframe 的一个小示例:

+--------+----------+----------+
|DeviceId|   Created| EventDate|
+--------+----------+----------+
|       1|      null|2017-03-09|
|       1|      null|2017-03-09|
|       1|2017-03-09|2017-03-09|
|       1|2017-03-15|2017-03-15|
|       1|      null|2017-05-06|
|       1|2017-05-06|2017-05-06|
|       1|      null|      null|
+--------+----------+----------+

当没有空值时,我发现下面的代码可以转换数据类型:

dt_func =  udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType())    
df = df.withColumn('Created', dt_func(col('Created')))

一旦我添加了null值,它就会崩溃。我尝试修改udf来考虑null值,如下所示:

import numpy as np
def convertDatetime(x):
    return sf.when(x.isNull(), 'null').otherwise(datetime.strptime(x, '%Y-%m-%d'))
dt_func =  udf(convertDatetime, DateType())

我还尝试用任意日期字符串填充空值,将列转换为日期,然后尝试用空值替换任意填充日期,如下所示:

def dt_conv(df, cols, form = '%Y-%m-%d', temp_plug = '1900-01-01'):
    df = df.na.fill(temp_plug)
    dt_func =  udf (lambda x: datetime.strptime(x, form), DateType())

    for col_ in cols:
        df = df.withColumn(col_, dt_func(col(col_)))
    df = df.replace(datetime.strptime(temp_plug, form), 'null')
    return df

但是,此方法会产生以下错误

ValueError: to_replace should be a float, int, long, string, list, tuple, or dict

有人能帮我弄明白吗?

mqxuamgl

mqxuamgl1#

试试这个-

# Some data, I added empty strings and nulls both
data = [(1,'','2017-03-09'),(1,None,'2017-03-09'),(1,'2017-03-09','2017-03-09')]

df = spark.createDataFrame(data).toDF('id','Created','EventDate')
df.show()

+---+----------+----------+
| id|   Created| EventDate|
+---+----------+----------+
|  1|          |2017-03-09|
|  1|      null|2017-03-09|
|  1|2017-03-09|2017-03-09|
+---+----------+----------+

df\
.withColumn('Created-formatted',when((df.Created.isNull() | (df.Created == '')) ,'0')\
.otherwise(unix_timestamp(df.Created,'yyyy-MM-dd')))\
.withColumn('EventDate-formatted',when((df.EventDate.isNull() | (df.EventDate == '')) ,'0')\
.otherwise(unix_timestamp(df.EventDate,'yyyy-MM-dd')))\
.drop('Created','EventDate')\
.show()

+---+-----------------+-------------------+
| id|Created-formatted|EventDate-formatted|
+---+-----------------+-------------------+
|  1|                0|         1489035600|
|  1|                0|         1489035600|
|  1|       1489035600|         1489035600|
+---+-----------------+-------------------+

我使用了unix_timestamp,它返回BigInt格式,但您可以根据需要设置列的格式。

idfiyjo8

idfiyjo82#

试试这个...只是铸造列!

df_new = (df
            .select(to_date(col("df.EventDate"),"yyyy-MM-dd")
            .alias("EventDate-formatted")
         )

相关问题