我有一个这样的表,我做正常的旋转,这是不给预期的结果。
+-----------------------------------------------------------+-----------------------------------------------------------------------------
|type |column_ref |table_object |
+-----------------------------------------------------------+-----------------------------------------------------------------------------
foreignKeyColumn FRED.FRED.BACHELOR_DEGREE_OR_HIGHER.REGION_CODE FRED.FRED.US_REGIONS |
primaryKeyColumn FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS |
foreignKeyColumn FRED.FRED.MEAN_REAL_WAGES_COLA.REGION_CODE FRED.FRED.US_REGIONS |
primaryKeyColumn FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS |
foreignKeyColumn FRED.FRED.PER_CAPITA_PERSONAL_INCOME.REGION_CODE FRED.FRED.US_REGIONS |
primaryKeyColumn FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS |
foreignKeyColumn FRED.FRED.HOMEOWNERSHIP_RATE.REGION_CODE FRED.FRED.US_REGIONS |
primaryKeyColumn FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS |
+-----------------------------------------------------------+-----------------------------------------------------------------------------
我想这样做:
+-----------------------------------------------------------+-----------------------------------------------------------------------------
|foreignKeyColumn |primaryKeyColumn |table_object |
+-----------------------------------------------------------+-----------------------------------------------------------------------------
FRED.FRED.BACHELOR_DEGREE_OR_HIGHER.REGION_CODE FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS
FRED.FRED.MEAN_REAL_WAGES_COLA.REGION_CODE FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS
FRED.FRED.PER_CAPITA_PERSONAL_INCOME.REGION_CODE FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS
FRED.FRED.HOMEOWNERSHIP_RATE.REGION_CODE FRED.FRED.US_REGIONS.REGION_CODE FRED.FRED.US_REGIONS
我试着这样做:
val grouped = df.groupBy("table_object").pivot("type").agg(first("column_ref"))
但它并没有像预期的那样给予。有人能建议怎样才能达到预期的效果吗?
普通的sql/df解决方案都会有帮助。
1条答案
按热度按时间fkvaft9z1#
也许这是有帮助的-
pivot
+arrays_zip
```df2.show(false)
df2.printSchema()
/**
* +----------------+------------------------------------------------+--------------------+
* |type |column_ref |table_object |
* +----------------+------------------------------------------------+--------------------+
* |foreignKeyColumn|FRED.FRED.BACHELOR_DEGREE_OR_HIGHER.REGION_CODE |FRED.FRED.US_REGIONS|
* |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE |FRED.FRED.US_REGIONS|
* |foreignKeyColumn|FRED.FRED.MEAN_REAL_WAGES_COLA.REGION_CODE |FRED.FRED.US_REGIONS|
* |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE |FRED.FRED.US_REGIONS|
* |foreignKeyColumn|FRED.FRED.PER_CAPITA_PERSONAL_INCOME.REGION_CODE|FRED.FRED.US_REGIONS|
* |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE |FRED.FRED.US_REGIONS|
* |foreignKeyColumn|FRED.FRED.HOMEOWNERSHIP_RATE.REGION_CODE |FRED.FRED.US_REGIONS|
* |primaryKeyColumn|FRED.FRED.US_REGIONS.REGION_CODE |FRED.FRED.US_REGIONS|
* +----------------+------------------------------------------------+--------------------+
*
* root
* |-- type: string (nullable = true)
* |-- column_ref: string (nullable = true)
* |-- table_object: string (nullable = true)
*/
val p = df2
.groupBy("table_object")
.pivot("type")
.agg(collect_list("column_ref"))