pysparksql:将带有结构数组的表转换为列

ijxebb2r  于 2021-05-19  发布在  Spark
关注(0)|答案(1)|浏览(422)

我的配置单元表有两列(string,array<struct<type=string,cnt=int>>),如下所示:
||id | |参数||
||id1 | |[{type=a,cnt=4},{type=b,cnt=2}]
||id2 | |[{type=a,cnt=3},{type=c,cnt=1},{type=d,cnt=0}]
||id3 | |[{type=e,cnt=1}]
我需要将其转换为具有分隔int列的表,其中列名称为“types”,值等于cnt:
||id | | a | | b | | c | | d | e||
||id1 | | 4 | | 2 | |空| | |空||
||id2 | | 3 | |空| | 1 | | 0 | |空||
||id3 | |空| |空| |空| | 1||
转换表的最佳有效方法是什么?sparksql和pyspark风格。谢谢您。

af7jpaap

af7jpaap1#

试试这个-不确定是否需要sum,但似乎可以安全地假设:

from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType

# Some variation in your data

df = spark.createDataFrame([(1, ["type=AA, cnt=4", "type=B, cnt=2222"]),
                            (2, ["type=AA, cnt=3", "type=C, cnt=1", "type=D, cnt=0"]),
                            (3, ["type=E, cnt=1"])],["id", "params"])

# Explode

df = df.select(df.id, F.explode(df.params))

# Make separate cols and trip leading strings & convert to Int

split_col = F.split(df['col'], ',')
df = df.withColumn('type', split_col.getItem(0)).withColumn('count', split_col.getItem(1)).drop('col')
df = df.withColumn('type',F.expr("substring(type, 6, length(type))")).withColumn('count',F.expr("substring(count, 6, length(count))").cast(IntegerType()))

# Pivot to your format

df.groupBy("id").pivot("type").agg(F.sum("count")).sort(F.col("id").asc()).show()

退货:

+---+----+----+----+----+----+
| id|  AA|   B|   C|   D|   E|
+---+----+----+----+----+----+
|  1|   4|2222|null|null|null|
|  2|   3|null|   1|   0|null|
|  3|null|null|null|null|   1|
+---+----+----+----+----+----+

相关问题