我有一个hdfs格式的文本文件,格式如下: 0029029070999991901010106004
+64333+023450fm-12+00001n9- 0078
1+381 0035029070999991902010113004
+64333+023450fm-12+00001n9- 0100
1+381
我要做的是将前25个字符作为字符串,将前4个数字除以10作为第二个“减号”后的两倍,然后跳过所有其他字符,如:
ID | Column
---------------------------- | ----
0029029070999991901010106004 | 007.8
0035029070999991902010113004 | 010.0
我该怎么做?谢谢大家!
1条答案
按热度按时间vi4fp9gy1#
检查以下代码。
可能是你的预期输出是错误的。
00781/10.0 =78.1
不是7.8
01001/10.0 = 100.1
不是10.0
```scala> val df = spark.read.text("/tmp/data")
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.show(false)
+--------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999MW1381|
|0035029070999991902010113004+64333+023450FM-12+000599999V0201401N011819999999N0000001N9-01001+99999100311ADDGF104991999999999999999999MW1381|
+--------------------------------------------------------------------------------------------------------------------------------------------+
scala> df
.withColumn("id",regexp_extract($"value","(^[0-9]{28})",0))
.withColumn("column",(regexp_extract($"value","N9-([0-9]{5})",1)/lit(10.0)).cast("double"))
.select("id","column")
.show(false)
+----------------------------+------+
|id |column|
+----------------------------+------+
|0029029070999991901010106004|78.1 |
|0035029070999991902010113004|100.1 |
+----------------------------+------+
scala> df
.withColumn("id",regexp_extract($"value","(^[0-9]{28})",0))
.withColumn("column",(regexp_extract($"value","N9-([0-9]{4})",1)/lit(10.0)).cast("double"))
.select("id","column")
.show(false)
+----------------------------+------+
|id |column|
+----------------------------+------+
|0029029070999991901010106004|7.8 |
|0035029070999991902010113004|10.0 |
+----------------------------+------+