python PySpark -从包含日期和月份的两个数字列创建时间戳列

vs3odd8k  于 2023-08-02  发布在  Python
关注(0)|答案(1)|浏览(79)
>>> from pyspark import SparkContext
>>> sc = SparkContext.getOrCreate()
>>> from pyspark.sql import SparkSession
>>> spark_session = SparkSession(sc)
>>> mydf = spark_session.createDataFrame(data=[(2018, 1), (2019, 4), (2018, 3), (2019, 4), (2018, 2), (2020, 1), (2020, 4)], schema=['myYear', 'myMonth'])
>>> mydf
DataFrame[myYear: bigint, myMonth: bigint]
>>> mydf.show()
+------+-------+
|myYear|myMonth|
+------+-------+
|  2018|      1|
|  2019|      4|
|  2018|      3|
|  2019|      4|
|  2018|      2|
|  2020|      1|
|  2020|      4|
+------+-------+

字符串
和版本:

$ pip3 list | grep 'spark'
pyspark             3.4.1
$ java -version
openjdk version "1.8.0_372"
OpenJDK Runtime Environment (build 1.8.0_372-b07)
OpenJDK 64-Bit Server VM (build 25.372-b07, mixed mode)


现在,我想从这两个列创建一个时间戳列,表示年和月时间戳。
Pandas的当量为:

>>> import pandas as pd
>>> mydf3_pd = pd.DataFrame(data=[(2018, 1), (2019, 4), (2018, 3), (2019, 4), (2018, 2), (2020, 1), (2020, 4)], columns=['myYear', 'myMonth'])
>>> mydf3_pd.loc[:,'year-month'] = pd.to_datetime(mydf3.loc[:,'myYear'].astype(str) + mydf3.loc[:,'myMonth'].astype(str), format='%Y%m')
>>> mydf3_pd
   myYear    myMonth year-month
0    2018          1 2018-01-01
1    2019          4 2019-04-01
2    2018          3 2018-03-01
3    2019          4 2019-04-01
4    2018          2 2018-02-01
5    2020          1 2020-01-01
6    2020          4 2020-04-01
>>> mydf3_pd.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   myYear      7 non-null      int64         
 1   myMonth     7 non-null      int64         
 2   year-month  7 non-null      datetime64[ns]
dtypes: datetime64[ns](1), int64(2)


我尝试了pyspark.sql.functions.to_timestamppyspark.sql.functions.to_dateformat='yyyyMM'等组合,并得到了无法理解的错误信息的时间。
沿着“year-month”之外,我还想对包含季度数字的整数列“year-quarter”执行相同的操作。我怎么用pyspark把他们两个都搞定呢?

flvlnr44

flvlnr441#

在这种情况下尝试使用to_date()函数。

  • concat_ws('-',<cols>)-用-分隔的年份、月份。
    Example:
from pyspark.sql.functions import *
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df = spark.createDataFrame(data=[(2018, 1), (2019, 4), (2018, 3), (2019, 4), (2018, 2), (2020, 1), (2020, 4)], schema=['myYear', 'myMonth'])
df.withColumn("year-month",expr("to_date(concat_ws('-',myyear,mymonth))")).show(10,False)
#+------+-------+----------+
#|myYear|myMonth|year-month|
#+------+-------+----------+
#|2018  |1      |2018-01-01|
#|2019  |4      |2019-04-01|
#|2018  |3      |2018-03-01|
#|2019  |4      |2019-04-01|
#|2018  |2      |2018-02-01|
#|2020  |1      |2020-01-01|
#|2020  |4      |2020-04-01|
#+------+-------+----------+

字符串

相关问题