如何从spark中的文本文件创建datafame

tv6aics1  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(403)

我有一个hdfs格式的文本文件,格式如下: 0029029070999991901010106004 +64333+023450fm-12+00001n9- 0078 1+381 0035029070999991902010113004 +64333+023450fm-12+00001n9- 0100 1+381
我要做的是将前25个字符作为字符串,将前4个数字除以10作为第二个“减号”后的两倍,然后跳过所有其他字符,如:

ID                                | Column
----------------------------      | ----
0029029070999991901010106004      | 007.8
0035029070999991902010113004      | 010.0

我该怎么做?谢谢大家!

vi4fp9gy

vi4fp9gy1#

检查以下代码。
可能是你的预期输出是错误的。 00781/10.0 =78.1 不是
7.8 01001/10.0 = 100.1 不是 10.0 ```
scala> val df = spark.read.text("/tmp/data")
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.show(false)
+--------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999MW1381|
|0035029070999991902010113004+64333+023450FM-12+000599999V0201401N011819999999N0000001N9-01001+99999100311ADDGF104991999999999999999999MW1381|
+--------------------------------------------------------------------------------------------------------------------------------------------+

scala> df
.withColumn("id",regexp_extract($"value","(^[0-9]{28})",0))
.withColumn("column",(regexp_extract($"value","N9-([0-9]{5})",1)/lit(10.0)).cast("double"))
.select("id","column")
.show(false)

+----------------------------+------+
|id |column|
+----------------------------+------+
|0029029070999991901010106004|78.1 |
|0035029070999991902010113004|100.1 |
+----------------------------+------+

更新了上面的忽略。
如果你只想要4位数字,你可以试试下面的代码。

scala> df
.withColumn("id",regexp_extract($"value","(^[0-9]{28})",0))
.withColumn("column",(regexp_extract($"value","N9-([0-9]{4})",1)/lit(10.0)).cast("double"))
.select("id","column")
.show(false)

+----------------------------+------+
|id |column|
+----------------------------+------+
|0029029070999991901010106004|7.8 |
|0035029070999991902010113004|10.0 |
+----------------------------+------+

相关问题