我试图使用pysparksql在列中选择具有必需值的行,但它给了我一个错误
下面是我的table session
```
session_id status session_desc
session1 Old first_description
session1 Active last_description
session1 Old next_description
session1 Active inter_description
session2 Old next_description
session2 Old inter_description
下面是我的sparksql查询
spark.sql("select session_id, (CASE WHEN status='Active' THEN session_desc END) AS session_description from session group by session_id").show()
但我的错误率越来越低
org.apache.spark.sql.AnalysisException: expression 'session.status' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
我需要如下所示
session_id session_description
session1 last_description # can be inter_description as well (I don't care)
session2 null
1条答案
按热度按时间igetnqfo1#
使用你的
case statement inside subquery
然后按外部查询分组。Example:
```df.show()
+----------+------+-----------------+
|session_id|status| session_Desc|
+----------+------+-----------------+
| session1| Old|first_description|
| session1|Active| last_description|
| session1| Old| next_description|
| session1|Active|inter_description|
| session2| Old| next_description|
| session2| Old|inter_description|
+----------+------+-----------------+
spark.sql("select session_id,last(session_desc)session_description from (
select session_id,case when status='Active' THEN session_desc END as session_desc from tmp)t
group by session_id").
show()
+----------+-------------------+
|session_id|session_description|
+----------+-------------------+
| session1| inter_description|
| session2| null|
+----------+-------------------+