pyspark 在SQL中Pivot和Group By的区别

n3h0vuf2  于 2023-10-15  发布在  Spark
关注(0)|答案(1)|浏览(133)

我有如下数据:

| OmgevingID | AdministratieKantoorID | WerkgeverID | AdministratieID | JaarID | VolgnummerRun | PersoneelsnummerVerloning | Periode | Component | Tabel | Datum     | 
|------------|------------------------|-------------|-----------------|--------|---------------|---------------------------|---------|-----------|-------|-----------|
| 28         | 1                      | 19          | 243             | 2022   | 34            | 33                        | 6       | 70        | 5557.0| 2023-09-20|
| 28         | 1                      | 19          | 243             | 2022   | 34            | 33                        | 6       | 260       | 3990.0| 2023-09-20|
| 28         | 1                      | 19          | 243             | 2022   | 34            | 40                        | 6       | 70        | 2610.0| 2023-09-20|
| 28         | 1                      | 19          | 243             | 2022   | 34            | 40                        | 6       | 260       | 2475.0| 2023-09-20|

现在,我想转换它,使它看起来像这样,与一个单独的列为每个组件的选择。对于我想要的值,“表”列。下面是一个预期输出的示例。

| OmgevingID | AdministratieKantoorID | WerkgeverID | AdministratieID | JaarID | VolgnummerRun | PersoneelsnummerVerloning | Periode | Component_70_tabel | Component_260_tabel | Datum     | 
|------------|------------------------|-------------|-----------------|--------|---------------|---------------------------|---------|--------------------|---------------|-----------|
| 28         | 1                      | 19          | 243             | 2022   | 34            | 33                        | 6       | 5577.0             | 3990.0        | 2023-09-20|
| 28         | 1                      | 19          | 243             | 2022   | 34            | 40                        | 6       | 2610.0             | 2475.0        | 2023-09-20|

我通过使用以下查询实现了这一点:

SELECT
    OmgevingID,
    AdministratieKantoorID,
    WerkgeverID,
    AdministratieID,
    JaarID,
    VolgnummerRun,
    PersoneelsnummerVerloning,
    Periode,
    MAX(CASE WHEN Component = 70  THEN Tabel END) AS Component_70_tabel,
    MAX(CASE WHEN Component = 260 THEN Tabel END) AS Component_260_tabel,
    Datum
FROM
    YourTableName
GROUP BY
    OmgevingID,
    AdministratieKantoorID,
    WerkgeverID,
    AdministratieID,
    JaarID,
    VolgnummerRun,
    PersoneelsnummerVerloning,
    Periode,
    Datum;

但我正在寻找一种更简单,更直观的方法,也许是一个枢轴?对于初学者来说,MAX(CASE WHEN...)语法有点奇怪,因为OmgevingID、AdministratieKantoorID、WerkgeverID、JaarID、VolgnummerRun、PersoneelsnummerVerloning、Periode和Component字段的组合是一个复合主键。所以每个组只能有一个表值,这让MAX感觉有点奇怪。我使用的是spark-sql,但是如果在pyspark中有一个干净的方法,这也是可选的。

ia2d9nvy

ia2d9nvy1#

在pyspark中,我们可以使用groupby + pivot来重新塑造框架

keys = ['OmgevingID', 'AdministratieKantoorID', 'WerkgeverID',
        'AdministratieID', 'JaarID', 'VolgnummerRun',
        'PersoneelsnummerVerloning', 'Periode', 'Datum']

df1 = (
    df
    .withColumn('Component', F.expr("'Component_' || Component || '_Tabel'"))
    .groupBy(keys)
    .pivot('Component')
    .agg(F.max('Tabel'))
)
df1.show()

+----------+----------------------+-----------+---------------+------+-------------+-------------------------+-------+----------+-------------------+------------------+
|OmgevingID|AdministratieKantoorID|WerkgeverID|AdministratieID|JaarID|VolgnummerRun|PersoneelsnummerVerloning|Periode|     Datum|Component_260_Tabel|Component_70_Tabel|
+----------+----------------------+-----------+---------------+------+-------------+-------------------------+-------+----------+-------------------+------------------+
|        28|                     1|         19|            243|  2022|           34|                       33|      6|2023-09-20|             3990.0|            5557.0|
|        28|                     1|         19|            243|  2022|           34|                       40|      6|2023-09-20|             2475.0|            2610.0|
+----------+----------------------+-----------+---------------+------+-------------+-------------------------+-------+----------+-------------------+------------------+

相关问题