我有一个PyparkDataframe,看起来像这样:
+--------+----------+---------+----------+-----------+--------------------+
|order_id|product_id|seller_id| date|pieces_sold| bill_raw_text|
+--------+----------+---------+----------+-----------+--------------------+
| 668| 886059| 3205|2015-01-14| 91|pbdbzvpqzqvtzxone...|
| 6608| 541277| 1917|2012-09-02| 44|cjucgejlqnmfpfcmg...|
| 12962| 613131| 2407|2016-08-26| 90|cgqhggsjmrgkrfevc...|
| 14223| 774215| 1196|2010-03-04| 46|btujmkfntccaewurg...|
| 15131| 769255| 1546|2018-11-28| 13|mrfsamfuhpgyfjgki...|
| 15625| 86357| 2455|2008-04-18| 50|wlwsliatrrywqjrih...|
| 18470| 26238| 295|2009-03-06| 86|zrfdpymzkgbgdwFwz...|
| 29883| 995036| 4596|2009-10-25| 86|oxcutwmqgmioaelsj...|
| 38428| 193694| 3826|2014-01-26| 82|yonksvwhrfqkytypr...|
| 41023| 949332| 4158|2014-09-03| 83|hubxhfdtxrqsfotdq...|
+--------+----------+---------+----------+-----------+--------------------+
我想创建两个列,一个是季度,另一个是周数。以下是我所做的,参考了一周和一个季度的文档:
from pyspark.sql import functions as F
sales_table = sales_table.withColumn("week_year", F.date_format(F.to_date("date", "yyyy-mm-dd"),
F.weekofyear("d")))
sales_table = sales_table.withColumn("quarter", F.date_format(F.to_date("date", "yyyy-mm-dd"),
F.quarter("d")))
sales_table.show(10)
下面是错误:
Column is not iterable
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 945, in date_format
return Column(sc._jvm.functions.date_format(_to_java_column(date), format))
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1296, in __call__
args_command, temp_args = self._build_args(*args)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1260, in _build_args
(new_args, temp_args) = self._get_args(args)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1247, in _get_args
temp_arg = converter.convert(arg, self.gateway_client)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 510, in convert
for element in object:
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/column.py", line 353, in __iter__
raise TypeError("Column is not iterable")
TypeError: Column is not iterable
如何创建和附加这两列?
有没有更好或更有效的方法来创建这些列,而不必转换 date
列到 yyyy-mm-dd
每次格式化并在一个命令中创建这两列?
2条答案
按热度按时间von4xj4u1#
您可以只使用字符串列上的函数
date
直接。iszxjhcz2#
你不必使用
date_format
像你已经做的那样在这里工作date
在yyyy-MM-dd
格式直接使用week_of_year and quarter
在date
列。例子: