在pyspark中向多级嵌套结构添加列

epggiuax  于 2023-01-29  发布在  Spark
关注(0)|答案(2)|浏览(153)

我有一个pyspark Dataframe 与以下结构。
当前架构:

root
 |-- ID
 |-- Information
 |   |-- Name
 |   |-- Age
 |   |-- Gender
 |-- Description

我想在信息中添加名字和姓氏。姓名
有没有办法在pyspark中添加新的列,这样就有了多层次的结构类型?
预期架构:

root
 |-- ID
 |-- Information
 |   |-- Name
 |   |   |-- firstName
 |   |   |-- lastName
 |   |-- Age
 |   |-- Gender
 |-- Description
lmyy7pcs

lmyy7pcs1#

使用withField,这将起作用:

df=df.withColumn('Information', F.col('Information').withField('Name', F.struct(*[F.col('Information.Name').alias('FName'), F.lit('').alias('LName')])))

之前的架构:

root
 |-- Id: string (nullable = true)
 |-- Information: struct (nullable = true)
 |    |-- Name: string (nullable = true)
 |    |-- Age: integer (nullable = true)

之后的架构:

root
 |-- Id: string (nullable = true)
 |-- Information: struct (nullable = true)
 |    |-- Name: struct (nullable = false)
 |    |    |-- FName: string (nullable = true)
 |    |    |-- LName: string (nullable = false)
 |    |-- Age: integer (nullable = true)

我用Name的当前值初始化了Fname的值,如果需要的话,可以使用substring。

oxf4rvwz

oxf4rvwz2#

如果所有名称都遵循以下模式,则可以在空格上拆分。

FirstName LastName

带数据的示例代码。

from pyspark.sql.types import *
import pyspark.sql.functions as sqlf

data = [{
   "ID":1,
   "Information":{
      "Name":"Alice Wonderland",
      "Age":20,
      "Gender":"Female"
   },
   "Description":"Test data"
}]
  
schema = StructType([
            StructField("Description", StringType(), True),
            StructField("ID", IntegerType(), True),
            StructField("Information",
                StructType([
                    StructField("Name", StringType(), True),
                    StructField("Age", IntegerType(), True),
                    StructField("Gender", StringType(), True)
                ]),True)
         ])
 
df = spark.createDataFrame(data,schema)

splitName = sqlf.split(df.Information.Name,' ')

df=df.withColumn('Information', sqlf.col('Information')
.withField('Name', sqlf.struct(splitName[0].alias('firstName'), splitName[1].alias('lastName'))))

df.printSchema()
root
 |-- Description: string (nullable = true)
 |-- ID: integer (nullable = true)
 |-- Information: struct (nullable = true)
 |    |-- Name: struct (nullable = false)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)
 |    |-- Age: integer (nullable = true)
 |    |-- Gender: string (nullable = true)

df.show(truncate=False)
+-----------+---+---------------------------------+
|Description|ID |Information                      |
+-----------+---+---------------------------------+
|Test data  |1  |{{Alice, Wonderland}, 20, Female}|
+-----------+---+---------------------------------+

相关问题