我有下面的dataframe和一些包含数组的列(我们使用的是spark 1.6)
+--------------------+--------------+------------------+--------------+--------------------+-------------+
| UserName| col1 | col2 |col3 |col4 |col5 |
+--------------------+--------------+------------------+--------------+--------------------+-------------+
|foo |[Main, Indi...|[1777203, 1777203]| [GBP, GBP]| [CR, CR]| [143, 143]|
+--------------------+--------------+------------------+--------------+--------------------+-------------+
我期望得到以下结果:
+--------------------+--------------+------------------+--------------+--------------------+-------------+
| UserName| explod | explod2 |explod3 |explod4 |explod5 |
+--------------------+--------------+------------------+--------------+--------------------+-------------+
|NNNNNNNNNNNNNNNNN...| Main |1777203 | GBP | CR | 143 |
|NNNNNNNNNNNNNNNNN...|Individual |1777203 | GBP | CR | 143 |
----------------------------------------------------------------------------------------------------------
我试过侧视图:
sqlContext.sql("SELECT `UserName`, explod, explod2, explod3, explod4, explod5 FROM sourceDF
LATERAL VIEW explode(`col1`) sourceDF AS explod
LATERAL VIEW explode(`col2`) explod AS explod2
LATERAL VIEW explode(`col3`) explod2 AS explod3
LATERAL VIEW explode(`col4`) explod3 AS explod4
LATERAL VIEW explode(`col5`) explod4 AS explod5")
但我得到一个笛卡尔积,有很多重复项。我也尝试过同样的方法,用withcolumn方法分解所有的列,但是仍然得到很多重复项
.withColumn("col1", explode($"col1"))...
当然,我可以对最终的Dataframe进行区分,但这不是一个优雅的解决方案。有没有什么方法可以在不得到所有这些重复数据的情况下分解列?
谢谢!
1条答案
按热度按时间iyr7buue1#
如果您使用的是spark 2.4.0或更高版本,
arrays_zip
使任务更容易输出: