Pyspark将字符串转换为日期类型

t3irkdon  于 2023-06-05  发布在  Spark
关注(0)|答案(2)|浏览(131)

我正在尝试将字符串转换为日期类型。我在代码下面尝试。

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date, date_format

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Sample DataFrame with the string column
data = [("09-Aug-96",)]
df = spark.createDataFrame(data, ["date_string"])

# Convert string to date type
df = df.withColumn("date", to_date(df.date_string, "dd-MMM-yy"))

# Format the date as "dd-MM-yyyy"
df = df.withColumn("formatted_date", date_format(df.date, "dd-MM-yyyy"))

# Update the year part to four digits
df = df.withColumn("updated_date", date_format(df.date, "dd-MM-yyyy"))

# Show the result
df.show(truncate=False)

结果:

+-----------+----------+--------------+------------+
|date_string|date      |formatted_date|updated_date|
+-----------+----------+--------------+------------+
|09-Aug-96  |2096-08-09|09-08-2096    |09-08-2096  |
+-----------+----------+--------------+------------+

但我希望是1996年。即以这种格式09-08-1996。同样,如果是05-Sep-23,我希望它是05-09-2023。

qcuzuvrc

qcuzuvrc1#

有一种简单粗暴的方法可以实现所需的输出,即使用when().otherwise()但问题是,数据不应该有1900年至1923年等日期。例如,您无法通过输入格式区分01-Jan-192301-Jan-2023
这里有一个例子
你首先检查字符串日期中的年份,如果年份在0和(比如)25之间,你可以在字符串日期中的年初连接一个"20"

data_sdf. \
    withColumn('new_str_dt', 
               func.when(func.substring('str_dt', 8, 2).cast('int').between(0, 25), 
                         func.concat(func.substring('str_dt', 1, 7), func.lit('20'), func.substring('str_dt', 8, 2))
                         ).
               otherwise(func.concat(func.substring('str_dt', 1, 7), func.lit('19'), func.substring('str_dt', 8, 2)))
               ). \
    withColumn('dt', func.to_date('new_str_dt', 'dd-MMM-yyyy')). \
    show()

# +---------+-----------+----------+
# |   str_dt| new_str_dt|        dt|
# +---------+-----------+----------+
# |09-Aug-96|09-Aug-1996|1996-08-09|
# |01-Apr-23|01-Apr-2023|2023-04-01|
# +---------+-----------+----------+
lymnna71

lymnna712#

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date, date_format
import pyspark.sql.functions as F

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Sample DataFrame with the string column
data = [["19-Oct-18"],["23-Oct-18"],["1-Jan-93"],["4-Jun-03"],["1-Jan-97"],["25-Jul-97"]]
  
# giving column names of dataframe
columns = ["str_dt"]
  
# creating a dataframe
dataframe = spark.createDataFrame(data, columns)

# split column
df = dataframe.withColumn("dt", F.split('str_dt', '-').getItem(0)).withColumn("mon", F.split('str_dt', '-').getItem(1)).withColumn("yr", F.split('str_dt', '-').getItem(2))
current_year = datetime.now().strftime("%y")
df = df.withColumn('yer', F.when(F.col('yr').between(0, current_year), F.concat(F.lit('20'),F.col('yr').cast('string'))).otherwise(F.concat(F.lit('19'),F.col('yr').cast('string')))).withColumn('str_mod_dt', F.concat_ws('-',F.col('dt'),F.col('mon'),F.col('yer')))

# Convert string to date type
df = df.withColumn("date", to_date(F.col('str_mod_dt'), "d-MMM-yyyy"))

# Format the date as "dd-MM-yyyy"
df = df.withColumn("formatted_date", date_format(df.date, "dd-MM-yyyy"))

df.show()

相关问题