我的模式如下:
DataFrame[record_id: string, months: array<decimal(2,0)>, max_amount: decimal(12,2)]
数据如下所示:
+----------------+------------------+-------------------+
| record_id| months| max_amount|
+----------------+------------------+-------------------+
|3535345345345343| [4, 5, 6, 7, 9]| 17.33|
|3535345345345344| [7, 8, 9]| 9.57|
|3535345345345345| [4]| 1.00|
|3535345345345346|[4, 5, 6, 7, 8, 9]| 15.08|
|3535345345345347|[4, 5, 6, 7, 8, 9]| 17.11|
|3535345345345348| [4, 5, 7, 9]| 12.99|
|3535345345345349|[4, 5, 6, 7, 8, 9]| 16.95|
|3535345345345340| [4, 5, 6, 7, 8]| 12.99|
|3535345345345311|[4, 5, 6, 7, 8, 9]| 12.99|
|3535345345345542|[4, 5, 6, 7, 8, 9]| 12.99|
+----------------+------------------+-------------------+
我想在months列下过滤数组中存在的值的数据(例如:获取列表中具有month值6的所有行)。我尝试了以下方法,可以很好地处理字符串值:
import pyspark.sql.functions as sf
my_df.filter(sf.array_contains(my_df['months'], 6)).show()
但在int数组的情况下,我得到以下错误:
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(`months`, 6)' due to data type mismatch: Input to function array_contains should have been array followed by a value with same element type, but it's [array<decimal(2,0)>, int].
我也试过用 isin()
,但它不起作用。我是否必须修改作为中的第二个参数传递的整数值 array_contains()
为了让它工作?好心的建议。
暂无答案!
目前还没有任何答案,快来回答吧!