使用case类将列添加到Dataframe

8mmmxcuj  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(597)

我有一个dataframe(invoice),它有两列firstname和lastname,我想用case类创建一个新的列fullname。下面的代码不起作用,因为fullname列不在dataframe中。


**INPUT**

| firstname  | lastname    |
|:-----------|------------:|
| tom        |      jerry  |
| hank       |      polo   |

**OUTPUT**

| firstname  | lastname    | fullname     |
|:-----------|------------:|:------------:|
| tom        |      jerry  | tomjerry     |
| hank       |      polo   | hankpolo     |

val names = invoice.as[invoiceColumns].map(updateFields)

case class invoiceColumns (firstname :String,lastname:String,fullname:String)

  def updateFields(c: invoiceColumns): invoiceColumns= {
    val fullname = c.first+c.last+c.fullname
    c.copy(fullname = fullname)
  }
ljo96ir5

ljo96ir51#

也许这是有用的-

备选方案-1

case class invoiceColumns (firstname :String,lastname:String,fullname:String)

val df3 = Seq(("tom", "jerry"), ("hank", "polo")).toDF("firstname", "lastname")
    df3.show(false)
    df3.printSchema()
    /**
      * +---------+--------+
      * |firstname|lastname|
      * +---------+--------+
      * |tom      |jerry   |
      * |hank     |polo    |
      * +---------+--------+
      *
      * root
      * |-- firstname: string (nullable = true)
      * |-- lastname: string (nullable = true)
      */

    val p = df3.withColumn("fullname", concat(col("firstname"), col("lastname")))
      .as[invoiceColumns]
    p.show(false)
    p.printSchema()
    /**
      * +---------+--------+--------+
      * |firstname|lastname|fullname|
      * +---------+--------+--------+
      * |tom      |jerry   |tomjerry|
      * |hank     |polo    |hankpolo|
      * +---------+--------+--------+
      *
      * root
      * |-- firstname: string (nullable = true)
      * |-- lastname: string (nullable = true)
      * |-- fullname: string (nullable = true)
      */

备选方案-2

case class invoiceColumns2 (firstname :String,lastname:String,fullname:String) {
  def this(firstname :String,lastname:String) = {
    this(firstname, lastname, firstname + lastname)
  }
}

val p1 = df3.map{case Row(firstname: String, lastname: String) => new invoiceColumns2(firstname, lastname)}
    p1.show(false)
    p1.printSchema()
    /**
      * +---------+--------+--------+
      * |firstname|lastname|fullname|
      * +---------+--------+--------+
      * |tom      |jerry   |tomjerry|
      * |hank     |polo    |hankpolo|
      * +---------+--------+--------+
      *
      * root
      * |-- firstname: string (nullable = true)
      * |-- lastname: string (nullable = true)
      * |-- fullname: string (nullable = true)
      */
yiytaume

yiytaume2#

有几种不同的方法。

对输入和输出都使用case类

如果可以为输入和输出定义case类,则可以使用dataset api安全地完成此操作:

case class Input(firstname: String, lastname: String)
case class Output(firstname: String, lastname: String, fullname: String)
object Output {
  def apply(in: Input): Output =
    Output(in.firstname, in.lastname, in.firstname + in.lastname)
}

Seq(Input("tom", "jerry"), Input("hank", "polo"))
  .toDS()
  .map(Output.apply)
  .show()
+---------+--------+--------+
|firstname|lastname|fullname|
+---------+--------+--------+
|      tom|   jerry|tomjerry|
|     hank|    polo|hankpolo|
+---------+--------+--------+

仅对输出使用case类

由于在运行时检查列名,因此安全性较低:

case class Output(firstname: String, lastname: String, fullname: String)
object Output {
  def apply(firstname: String, lastname: String): Output =
    Output(firstname, lastname, firstname + lastname)
}

Seq(("tom", "jerry"), ("hank", "polo"))
  .toDF("firstname", "lastname")
  .map(row =>
    Output(row.getAs[String]("firstname"), row.getAs[String]("lastname")))
  .show()

产生相同的输出。

相关问题