>>> from pyspark import SparkContext
>>> sc = SparkContext.getOrCreate()
>>> from pyspark.sql import SparkSession
>>> spark_session = SparkSession(sc)
>>> mydf = spark_session.createDataFrame(data=[(2018, 1), (2019, 4), (2018, 3), (2019, 4), (2018, 2), (2020, 1), (2020, 4)], schema=['myYear', 'myMonth'])
>>> mydf
DataFrame[myYear: bigint, myMonth: bigint]
>>> mydf.show()
+------+-------+
|myYear|myMonth|
+------+-------+
| 2018| 1|
| 2019| 4|
| 2018| 3|
| 2019| 4|
| 2018| 2|
| 2020| 1|
| 2020| 4|
+------+-------+
字符串
和版本:
$ pip3 list | grep 'spark'
pyspark 3.4.1
$ java -version
openjdk version "1.8.0_372"
OpenJDK Runtime Environment (build 1.8.0_372-b07)
OpenJDK 64-Bit Server VM (build 25.372-b07, mixed mode)
型
现在,我想从这两个列创建一个时间戳列,表示年和月时间戳。
Pandas的当量为:
>>> import pandas as pd
>>> mydf3_pd = pd.DataFrame(data=[(2018, 1), (2019, 4), (2018, 3), (2019, 4), (2018, 2), (2020, 1), (2020, 4)], columns=['myYear', 'myMonth'])
>>> mydf3_pd.loc[:,'year-month'] = pd.to_datetime(mydf3.loc[:,'myYear'].astype(str) + mydf3.loc[:,'myMonth'].astype(str), format='%Y%m')
>>> mydf3_pd
myYear myMonth year-month
0 2018 1 2018-01-01
1 2019 4 2019-04-01
2 2018 3 2018-03-01
3 2019 4 2019-04-01
4 2018 2 2018-02-01
5 2020 1 2020-01-01
6 2020 4 2020-04-01
>>> mydf3_pd.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 myYear 7 non-null int64
1 myMonth 7 non-null int64
2 year-month 7 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2)
型
我尝试了pyspark.sql.functions.to_timestamp
和pyspark.sql.functions.to_date
与format='yyyyMM'
等组合,并得到了无法理解的错误信息的时间。
沿着“year-month”之外,我还想对包含季度数字的整数列“year-quarter”执行相同的操作。我怎么用pyspark把他们两个都搞定呢?
1条答案
按热度按时间flvlnr441#
在这种情况下尝试使用
to_date()
函数。concat_ws('-',<cols>)
-用-
分隔的年份、月份。Example:
字符串