在pyspark中的date上使用regex函数

b5buobof  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(367)

我需要验证pysparkDataframe中的日期(字符串格式),并且需要删除日期中的附加字符和符号(如果存在)。怎么验证呢?
我遇到了这个密码

regex_string='\/](19|[2-9][0-9])\d\d$)|(^29[\/]02[\/](19|[2-9][0-9])(00|04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96)$)'
df.select(regexp_extract(col("date"),regex_string,0).alias("cleaned_map"),col('date')).show()

下面是我的输出

+-----------+-----------+
|cleaned_map|       date|
+-----------+-----------+
|           |01/06/w2020|
|           |02/06/2!020|
| 02/06/2020| 02/06/2020|
| 03/06/2020| 03/06/2020|
| 04/06/2020| 04/06/2020|
| 05/06/2020| 05/06/2020|
| 02/06/2020| 02/06/2020|
+-----------+-----------+

我的预期产出

+-----------+-----------+
|cleaned_map|       date|
+-----------+-----------+
| 01/06/2020|01/06/w2020|
| 02/06/2020|02/06/20!20|
| 03/06/2020| 03/06/2020|
| 04/06/2020| 04/06/2020|
| 05/06/2020| 05/06/2020|
| 06/06/2020| 06/06/2020|
| 07/06/2020| 07/06/2020|
+-----------+-----------+
egmofgnx

egmofgnx1#

试试这个-

val df = Seq("01/06/w2020",
    "02/06/2!020",
    "02/06/2020",
    "03/06/2020",
    "04/06/2020",
    "05/06/2020",
    "02/06/2020",
    "//01/0/4/202/0").toDF("date")
    df.withColumn("cleaned_map", regexp_replace($"date", "[^0-9T]", ""))
      .withColumn("date_type", to_date($"cleaned_map", "ddMMyyyy"))
      .show(false)

    /**
      * +--------------+-----------+----------+
      * |date          |cleaned_map|date_type |
      * +--------------+-----------+----------+
      * |01/06/w2020   |01062020   |2020-06-01|
      * |02/06/2!020   |02062020   |2020-06-02|
      * |02/06/2020    |02062020   |2020-06-02|
      * |03/06/2020    |03062020   |2020-06-03|
      * |04/06/2020    |04062020   |2020-06-04|
      * |05/06/2020    |05062020   |2020-06-05|
      * |02/06/2020    |02062020   |2020-06-02|
      * |//01/0/4/202/0|01042020   |2020-04-01|
      * +--------------+-----------+----------+
      */

丰富这一模式 "[^0-9/T]" 如果要排除任何要删除的字符

2ledvvac

2ledvvac2#

尝试regexp\u replace删除其他字符符号。

df.show()

    # +-----------+
    # |       date|
    # +-----------+
    # |01/06/w2020|
    # |02/06/2!020|
    # | 02/06/2020|
    # +-----------+

 df.withColumn("cleaned_map", F.regexp_replace("date", r'[^\d\/]','')).show()

    # +-----------+-----------+
    # |       date|cleaned_map|
    # +-----------+-----------+
    # |01/06/w2020| 01/06/2020|
    # |02/06/2!020| 02/06/2020|
    # | 02/06/2020| 02/06/2020|
    # +-----------+-----------+

相关问题