选择具有不同分隔模式的值

olhwl3o2  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(413)

我有一个包含几个元素的文件。从spark scala中,我只想选择其中一个值。但它们的分离形式因价值观的不同而不同。我的档案如下:

"test, 27.08.2020.14.56.30, mary, products=[Product{id=123, origin=in}]"
"test, 27.08.2020.14.58.50, ane, products=[Product{id=1245, origin=on}]"

目的是得到一个这样的表

class             date              name       id
 test     27.08.2020.14.56.30       mary      123
 test     27.08.2020.14.58.50       ane       1245

我想连接同一行上的属性,然后将该标题与这些值关联并打印一个表。

val file= sc.textFile("C:\Users\test.txt")
val name = file.map(_.split(",")).map{x => (x(0),x(1),x(2))}
val id = file.map(_.split("=")).map{x => (x(3))}
val all = name.union(id).collect
val newNames = Seq("class","date","name","id")
val df = all.toDF(newNames: _*)
df.show()

但是,作为最后一个元素,我只想选择值为“123”的“id”,考虑到分隔不同,我也不知道如何选择这个数字。当我收集元素的时候,它给了我错误。如何选择这些元素并将它们连接起来以便以后与标题关联?

xbp102n0

xbp102n01#

也许我不明白你的问题,但你试过这个吗?

val tstSeq = spark.sparkContext.textFile("/user/admin/tst.txt")

val all = tstSeq.map(_.split(",")).map{x => (x(0),x(1),x(2), x(3).split("=")(2))}

val newNames = Seq("class","date","name","id")
val df = all.toDF(newNames: _*)

df.show

其输出为:

+-----+--------------------+-----+----+
|class|                date| name|  id|
+-----+--------------------+-----+----+
| test| 27.08.2020.14.56.30| mary| 123|
| test| 27.08.2020.14.58.50|  ane|1245|
+-----+--------------------+-----+----+

或:

val tstSeq = spark.sparkContext.textFile("/user/admin/tst.txt")

val all = tstSeq.map(_.split(",")).map{x => (x(0),x(1),x(2), x(3).split("=")(2), x(4).split("=")(1).replace("}", "").replace("]", ""))}

val newNames = Seq("class","date","name","id", "origin")
val df = all.toDF(newNames: _*)
df.show(false)

要获取此输出:

+-----+--------------------+-----+----+------+
|class|date                |name |id  |origin|
+-----+--------------------+-----+----+------+
|test | 27.08.2020.14.56.30| mary|123 |in    |
|test | 27.08.2020.14.58.50| ane |1245|on    |
+-----+--------------------+-----+----+------+

相关问题