pysparksql：将带有结构数组的表转换为列

ijxebb2r 于 2021-05-19 发布在 Spark

关注(0)|答案(1)|浏览(423)

我的配置单元表有两列（string，array<struct<type=string，cnt=int>>），如下所示：
||id | |参数||
||id1 | |[{type=a，cnt=4}，{type=b，cnt=2}]
||id2 | |[{type=a，cnt=3}，{type=c，cnt=1}，{type=d，cnt=0}]
||id3 | |[{type=e，cnt=1}]
我需要将其转换为具有分隔int列的表，其中列名称为“types”，值等于cnt:
||id | | a | | b | | c | | d | e||
||id1 | | 4 | | 2 | |空| | |空||
||id2 | | 3 | |空| | 1 | | 0 | |空||
||id3 | |空| |空| |空| | 1||
转换表的最佳有效方法是什么？sparksql和pyspark风格。谢谢您。

sql apache-spark pyspark

来源：https://stackoverflow.com/questions/64389197/pyspark-sql-transform-table-with-array-of-struct-to-columns

1条答案

按热度按时间

af7jpaap1#

试试这个-不确定是否需要sum，但似乎可以安全地假设：

from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType

# Some variation in your data

df = spark.createDataFrame([(1, ["type=AA, cnt=4", "type=B, cnt=2222"]),
                            (2, ["type=AA, cnt=3", "type=C, cnt=1", "type=D, cnt=0"]),
                            (3, ["type=E, cnt=1"])],["id", "params"])

# Explode

df = df.select(df.id, F.explode(df.params))

# Make separate cols and trip leading strings & convert to Int

split_col = F.split(df['col'], ',')
df = df.withColumn('type', split_col.getItem(0)).withColumn('count', split_col.getItem(1)).drop('col')
df = df.withColumn('type',F.expr("substring(type, 6, length(type))")).withColumn('count',F.expr("substring(count, 6, length(count))").cast(IntegerType()))

# Pivot to your format

df.groupBy("id").pivot("type").agg(F.sum("count")).sort(F.col("id").asc()).show()

退货：

+---+----+----+----+----+----+
| id|  AA|   B|   C|   D|   E|
+---+----+----+----+----+----+
|  1|   4|2222|null|null|null|
|  2|   3|null|   1|   0|null|
|  3|null|null|null|null|   1|
+---+----+----+----+----+----+

赞(0）回复(0）举报 2021-05-20

我来回答

pysparksql：将带有结构数组的表转换为列

1条答案

相关问题

热门标签

最新问答