如何使用sparksql在groupby内进行选择

vptzau2j  于 2021-05-29  发布在  Spark
关注(0)|答案(1)|浏览(542)

我试图使用pysparksql在列中选择具有必需值的行,但它给了我一个错误
下面是我的table session ```
session_id status session_desc


session1 Old first_description
session1 Active last_description
session1 Old next_description
session1 Active inter_description
session2 Old next_description
session2 Old inter_description

下面是我的sparksql查询

spark.sql("select session_id, (CASE WHEN status='Active' THEN session_desc END) AS session_description from session group by session_id").show()

但我的错误率越来越低

org.apache.spark.sql.AnalysisException: expression 'session.status' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

我需要如下所示

session_id session_description


session1 last_description # can be inter_description as well (I don't care)
session2 null

igetnqfo

igetnqfo1#

使用你的 case statement inside subquery 然后按外部查询分组。 Example: ```
df.show()

+----------+------+-----------------+

|session_id|status| session_Desc|

+----------+------+-----------------+

| session1| Old|first_description|

| session1|Active| last_description|

| session1| Old| next_description|

| session1|Active|inter_description|

| session2| Old| next_description|

| session2| Old|inter_description|

+----------+------+-----------------+

spark.sql("select session_id,last(session_desc)session_description from (
select session_id,case when status='Active' THEN session_desc END as session_desc from tmp)t
group by session_id").
show()

+----------+-------------------+

|session_id|session_description|

+----------+-------------------+

| session1| inter_description|

| session2| null|

+----------+-------------------+

相关问题