sql—使用mysql或sparkdataframe以不同的方式透视行

ijxebb2r  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(383)

我有一个这样的表,我做正常的旋转,这是不给预期的结果。

+-----------------------------------------------------------+-----------------------------------------------------------------------------
|type                                                       |column_ref                                            |table_object         |
+-----------------------------------------------------------+-----------------------------------------------------------------------------
foreignKeyColumn                                             FRED.FRED.BACHELOR_DEGREE_OR_HIGHER.REGION_CODE       FRED.FRED.US_REGIONS  |
primaryKeyColumn                                             FRED.FRED.US_REGIONS.REGION_CODE                      FRED.FRED.US_REGIONS  |

foreignKeyColumn                                             FRED.FRED.MEAN_REAL_WAGES_COLA.REGION_CODE            FRED.FRED.US_REGIONS  |
primaryKeyColumn                                             FRED.FRED.US_REGIONS.REGION_CODE                      FRED.FRED.US_REGIONS  |

foreignKeyColumn                                             FRED.FRED.PER_CAPITA_PERSONAL_INCOME.REGION_CODE      FRED.FRED.US_REGIONS  |
primaryKeyColumn                                             FRED.FRED.US_REGIONS.REGION_CODE                      FRED.FRED.US_REGIONS  |

foreignKeyColumn                                             FRED.FRED.HOMEOWNERSHIP_RATE.REGION_CODE              FRED.FRED.US_REGIONS  |
primaryKeyColumn                                             FRED.FRED.US_REGIONS.REGION_CODE                      FRED.FRED.US_REGIONS  |
+-----------------------------------------------------------+-----------------------------------------------------------------------------

我想这样做:

+-----------------------------------------------------------+-----------------------------------------------------------------------------
|foreignKeyColumn                                           |primaryKeyColumn                                            |table_object   |
+-----------------------------------------------------------+-----------------------------------------------------------------------------
FRED.FRED.BACHELOR_DEGREE_OR_HIGHER.REGION_CODE             FRED.FRED.US_REGIONS.REGION_CODE                              FRED.FRED.US_REGIONS
FRED.FRED.MEAN_REAL_WAGES_COLA.REGION_CODE                  FRED.FRED.US_REGIONS.REGION_CODE                              FRED.FRED.US_REGIONS
FRED.FRED.PER_CAPITA_PERSONAL_INCOME.REGION_CODE            FRED.FRED.US_REGIONS.REGION_CODE                              FRED.FRED.US_REGIONS
FRED.FRED.HOMEOWNERSHIP_RATE.REGION_CODE                    FRED.FRED.US_REGIONS.REGION_CODE                              FRED.FRED.US_REGIONS

我试着这样做:

val grouped = df.groupBy("table_object").pivot("type").agg(first("column_ref"))

但它并没有像预期的那样给予。有人能建议怎样才能达到预期的效果吗?
普通的sql/df解决方案都会有帮助。

fkvaft9z

fkvaft9z1#

也许这是有帮助的- pivot + arrays_zip ```
df2.show(false)
df2.printSchema()
/**
* +----------------+------------------------------------------------+--------------------+
* |type |column_ref |table_object |
* +----------------+------------------------------------------------+--------------------+
* |foreignKeyColumn|FRED.FRED.BACHELOR_DEGREE_OR_HIGHER.REGION_CODE |FRED.FRED.US_REGIONS|
* |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE |FRED.FRED.US_REGIONS|
* |foreignKeyColumn|FRED.FRED.MEAN_REAL_WAGES_COLA.REGION_CODE |FRED.FRED.US_REGIONS|
* |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE |FRED.FRED.US_REGIONS|
* |foreignKeyColumn|FRED.FRED.PER_CAPITA_PERSONAL_INCOME.REGION_CODE|FRED.FRED.US_REGIONS|
* |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE |FRED.FRED.US_REGIONS|
* |foreignKeyColumn|FRED.FRED.HOMEOWNERSHIP_RATE.REGION_CODE |FRED.FRED.US_REGIONS|
* |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE |FRED.FRED.US_REGIONS|
* +----------------+------------------------------------------------+--------------------+
*
* root
* |-- type: string (nullable = true)
* |-- column_ref: string (nullable = true)
* |-- table_object: string (nullable = true)
*/
val p = df2
.groupBy("table_object")
.pivot("type")
.agg(collect_list("column_ref"))

p
  .withColumn("x", arrays_zip($"foreignKeyColumn", $"primaryKeyColumn"))
  .selectExpr("table_object", "inline_outer(x)" )
  .show(false)

/**
  * +--------------------+------------------------------------------------+--------------------------------+
  * |table_object        |foreignKeyColumn                                |primaryKeyColumn                |
  * +--------------------+------------------------------------------------+--------------------------------+
  * |FRED.FRED.US_REGIONS|FRED.FRED.BACHELOR_DEGREE_OR_HIGHER.REGION_CODE |FRED.FRED.US_REGIONS.REGION_CODE|
  * |FRED.FRED.US_REGIONS|FRED.FRED.MEAN_REAL_WAGES_COLA.REGION_CODE      |FRED.FRED.US_REGIONS.REGION_CODE|
  * |FRED.FRED.US_REGIONS|FRED.FRED.PER_CAPITA_PERSONAL_INCOME.REGION_CODE|FRED.FRED.US_REGIONS.REGION_CODE|
  * |FRED.FRED.US_REGIONS|FRED.FRED.HOMEOWNERSHIP_RATE.REGION_CODE        |FRED.FRED.US_REGIONS.REGION_CODE|
  * +--------------------+------------------------------------------------+--------------------------------+
  */

相关问题