python-3.x 如何在SparkSQL中将集合的数组拆分为多列

8iwquhpp  于 2022-12-15  发布在  Python
关注(0)|答案(2)|浏览(185)

我在pyspark Dataframe 中的数据下面设置了一组数据,如下所示。

-+-----------------------------------------------------------------------------------+-
 |                             targeting_values                                      |
-+-----------------------------------------------------------------------------------+-
 | [('123', '123', '123'), ('abc', 'def', 'ghi'), ('jkl', 'mno', 'pqr'), (0, 1, 2)]  |
-+-----------------------------------------------------------------------------------+-

我想4个不同的列有设置在每列如下。

-+----------------------+----------------------+-----------------------+--------------------+-
 |  value1              |    value2            |     value3            |     value4         |
-+----------------------+----------------------+-----------------------+--------------------+-
 | ('123', '123', '123')|('abc', 'def', 'ghi') | ('jkl', 'mno', 'pqr') | (0, 1, 2)          |
-+----------------------+----------------------+-----------------------+--------------------+-

我试图通过使用Split()来实现这一点,但没有运气。我没有找到其他方法来解决这个问题。
有什么好办法吗?

omhiaaxx

omhiaaxx1#

你可以通过分解数组而不是旋转它来实现
//首先创建数据:

val arrayStructData = Seq(
  Row(List(Row("123", "123", "123"), Row("abc", "def", "ghi"), Row("jkl", "mno", "pqr"), Row("0", "1", "2"))),
  Row(List(Row("456", "456", "456"), Row("qsd", "fgh", "hjk"), Row("aze", "rty", "uio"), Row("4", "5", "6")))
)

val arrayStructSchema = new StructType()
  .add("targeting_values", ArrayType(new StructType()
    .add("_1", StringType)
    .add("_2", StringType)
    .add("_3", StringType)))

val df = spark.createDataFrame(spark.sparkContext
  .parallelize(arrayStructData), arrayStructSchema)

df.show(false)

+--------------------------------------------------------------+
|targeting_values                                              |
+--------------------------------------------------------------+
|[{123, 123, 123}, {abc, def, ghi}, {jkl, mno, pqr}, {0, 1, 2}]|
|[{456, 456, 456}, {qsd, fgh, hjk}, {aze, rty, uio}, {4, 5, 6}]|
+--------------------------------------------------------------+

//然后组合使用爆炸、创建和id,然后旋转它,如下所示:

df.withColumn("id2", monotonically_increasing_id())
  .select(col("id2"), posexplode(col("targeting_values"))).withColumn("id", concat(lit("value"), col("pos") + 1))
  .groupBy("id2").pivot("id").agg(first("col")).drop("id2")
  .show(false)

+---------------+---------------+---------------+---------+
|value1         |value2         |value3         |value4   |
+---------------+---------------+---------------+---------+
|{123, 123, 123}|{abc, def, ghi}|{jkl, mno, pqr}|{0, 1, 2}|
|{456, 456, 456}|{qsd, fgh, hjk}|{aze, rty, uio}|{4, 5, 6}|
+---------------+---------------+---------------+---------+
nnsrf1az

nnsrf1az2#

你可以试试这个:

df.selectExpr([f"targeting_values[{i}] as value{i+1}" for i in range(4)])

相关问题