使用case类将列添加到Dataframe

8mmmxcuj 于 2021-05-27 发布在 Spark

关注(0)|答案(2)|浏览(597)

我有一个dataframe（invoice），它有两列firstname和lastname，我想用case类创建一个新的列fullname。下面的代码不起作用，因为fullname列不在dataframe中。


**INPUT**

| firstname  | lastname    |
|:-----------|------------:|
| tom        |      jerry  |
| hank       |      polo   |

**OUTPUT**

| firstname  | lastname    | fullname     |
|:-----------|------------:|:------------:|
| tom        |      jerry  | tomjerry     |
| hank       |      polo   | hankpolo     |

val names = invoice.as[invoiceColumns].map(updateFields)

case class invoiceColumns (firstname :String,lastname:String,fullname:String)

  def updateFields(c: invoiceColumns): invoiceColumns= {
    val fullname = c.first+c.last+c.fullname
    c.copy(fullname = fullname)
  }

scala Dataset apache-spark case-class

来源：https://stackoverflow.com/questions/62928495/add-column-to-a-dataframe-using-case-class

2条答案

按热度按时间

ljo96ir51#

也许这是有用的-

备选方案-1

case class invoiceColumns (firstname :String,lastname:String,fullname:String)

val df3 = Seq(("tom", "jerry"), ("hank", "polo")).toDF("firstname", "lastname")
    df3.show(false)
    df3.printSchema()
    /**
      * +---------+--------+
      * |firstname|lastname|
      * +---------+--------+
      * |tom      |jerry   |
      * |hank     |polo    |
      * +---------+--------+
      *
      * root
      * |-- firstname: string (nullable = true)
      * |-- lastname: string (nullable = true)
      */

    val p = df3.withColumn("fullname", concat(col("firstname"), col("lastname")))
      .as[invoiceColumns]
    p.show(false)
    p.printSchema()
    /**
      * +---------+--------+--------+
      * |firstname|lastname|fullname|
      * +---------+--------+--------+
      * |tom      |jerry   |tomjerry|
      * |hank     |polo    |hankpolo|
      * +---------+--------+--------+
      *
      * root
      * |-- firstname: string (nullable = true)
      * |-- lastname: string (nullable = true)
      * |-- fullname: string (nullable = true)
      */

备选方案-2

case class invoiceColumns2 (firstname :String,lastname:String,fullname:String) {
  def this(firstname :String,lastname:String) = {
    this(firstname, lastname, firstname + lastname)
  }
}

val p1 = df3.map{case Row(firstname: String, lastname: String) => new invoiceColumns2(firstname, lastname)}
    p1.show(false)
    p1.printSchema()
    /**
      * +---------+--------+--------+
      * |firstname|lastname|fullname|
      * +---------+--------+--------+
      * |tom      |jerry   |tomjerry|
      * |hank     |polo    |hankpolo|
      * +---------+--------+--------+
      *
      * root
      * |-- firstname: string (nullable = true)
      * |-- lastname: string (nullable = true)
      * |-- fullname: string (nullable = true)
      */

赞(0）回复(0）举报 2021-05-27

yiytaume2#

有几种不同的方法。

对输入和输出都使用case类

如果可以为输入和输出定义case类，则可以使用dataset api安全地完成此操作：

case class Input(firstname: String, lastname: String)
case class Output(firstname: String, lastname: String, fullname: String)
object Output {
  def apply(in: Input): Output =
    Output(in.firstname, in.lastname, in.firstname + in.lastname)
}

Seq(Input("tom", "jerry"), Input("hank", "polo"))
  .toDS()
  .map(Output.apply)
  .show()

+---------+--------+--------+
|firstname|lastname|fullname|
+---------+--------+--------+
|      tom|   jerry|tomjerry|
|     hank|    polo|hankpolo|
+---------+--------+--------+

仅对输出使用case类

由于在运行时检查列名，因此安全性较低：

case class Output(firstname: String, lastname: String, fullname: String)
object Output {
  def apply(firstname: String, lastname: String): Output =
    Output(firstname, lastname, firstname + lastname)
}

Seq(("tom", "jerry"), ("hank", "polo"))
  .toDF("firstname", "lastname")
  .map(row =>
    Output(row.getAs[String]("firstname"), row.getAs[String]("lastname")))
  .show()

产生相同的输出。

赞(0）回复(0）举报 2021-05-27

我来回答

使用case类将列添加到Dataframe

2条答案

备选方案-1

备选方案-2

对输入和输出都使用case类

仅对输出使用case类

相关问题

热门标签

最新问答