我正在尝试将具有嵌套数据的记录拆分为多个记录。
df = spark.createDataFrame([('1','[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1},{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]'),('2','[{price:100, quantity:1}]')],['id','data'])
输入数据看起来像
id,data
1,[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1},{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]
2,[{price:100, quantity:1}]
如果数组列包含5条以上的记录,则需要拆分记录,并为每行提供和id2
id,id2,data
1,1,[{price:100, quantity:1},{price:200, quantity:2},{price:900, quantity:3},{price:500, quantity:5},{price:100, quantity:1}]
1,2,[{price:800, quantity:8},{price:700, quantity:7},{price:600, quantity:6}]
2,1,[{price:100, quantity:1}]
我尝试分解数组列,但得到每个元素的新行,即对于id 1,得到8行而不是2行。
如何使其分解,使每行在数组中至少包含5条记录?
1条答案
按热度按时间qncylg1j1#
对于spark 2.4+,可以使用sparksql builitin函数sequence+transform并对数组索引进行一些计算: