如何在spark中将一个分割成多个?

6ie5vjzr  于 2021-05-22  发布在  Spark
关注(0)|答案(1)|浏览(799)

我正在尝试将具有嵌套数据的记录拆分为多个记录。

df = spark.createDataFrame([('1','[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1},{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]'),('2','[{price:100, quantity:1}]')],['id','data'])

输入数据看起来像

id,data
1,[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1},{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]
2,[{price:100, quantity:1}]

如果数组列包含5条以上的记录,则需要拆分记录,并为每行提供和id2

id,id2,data
1,1,[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1}]
1,2,[{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]
2,1,[{price:100, quantity:1}]

我尝试分解数组列,但得到每个元素的新行,即对于id 1,得到8行而不是2行。
如何使其分解,使每行在数组中至少包含5条记录?

qncylg1j

qncylg1j1#

对于spark 2.4+,可以使用sparksql builitin函数sequence+transform并对数组索引进行一些计算:

from pyspark.sql import functions as F

df = spark.createDataFrame([('1','[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1},{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]'),('2','[{price:100, quantity:1}]')],['id','data'])

N = 5

# for data column, convert String into array of structs

df1 = df.withColumn("data", F.from_json("data", "array<struct<price:int,quantity:int>>",{"allowUnquotedFieldNames":"True"}))

df1.selectExpr("id", f"""
    inline_outer(
      transform(
        sequence(1,ceil(size(data)/{N})), i ->
        (i as id2, slice(data,(i-1)*{N}+1,{N}) as data)
      )
    )
 """).show(truncate=False)
+---+---+--------------------------------------------------+
|id |id2|data                                              |
+---+---+--------------------------------------------------+
|1  |1  |[[100, 1], [200, 2], [900, 3], [500, 5], [100, 1]]|
|1  |2  |[[800, 8], [700, 7], [600, 6]]                    |
|2  |1  |[[100, 1]]                                        |
+---+---+--------------------------------------------------+

相关问题