将数据加载到spark dataframe中,源代码中没有分隔符

5t7ly7z5  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(430)

我有一个没有分隔符的数据集:

111222333444
555666777888

期望输出:

|_c1_|_c2_|_c3_|_c4_|
|111 |222 |333 |444 |
|555 |666 |777 |888 |

我试过这种方法来获得结果

val myDF = spark.sparkContext.textFile("myFile").toDF()
val myNewDF = myDF.withColumn("c1", substring(col("value"), 0, 3))
                  .withColumn("c2", substring(col("value"), 3, 6))
                  .withColumn("c3", substring(col("value"), 6, 9)
                  .withColumn("c4", substring(col("value"), 9, 12))
             .drop("value") 
             .show()

但是我需要操作c4(乘以100),但是数据类型是string而不是double。
更新:我执行这个时遇到了一个场景,

val myNewDF = myDF.withColumn("c1", expr("substring(value, 0, 3)"))
.withColumn("c2",  expr("substring(value, 3, 6"))
.withColumn("c3", expr("substring(value, 6, 9)"))
.withColumn("c4", (expr("substring(value, 9, 12)").cast("double") * 100))
.drop("value")
``` `myNewDF.show(5,false)` //它只显示“value”列(我去掉了它)和“c1”列 `myNewDF.printSchema` //仅显示2行。为什么它没有显示所有新创建的4列?
xxslljrj

xxslljrj1#

留给自己一点困惑,比如1)读取文件并显式命名数据集/Dataframe列,这种rdd模拟方法应该可以帮助您:

val rdd = sc.parallelize(Seq(("111222333444"), 
                             ("555666777888")
                            )
                        )

val df = rdd.map(x => (x.slice(0,3), x.slice(3,6), x.slice(6,9), x.slice(9,12))).toDF()  
df.show(false)

退货:

+---+---+---+---+
|_1 |_2 |_3 |_4 |
+---+---+---+---+
|111|222|333|444|
|555|666|777|888|
+---+---+---+---+


使用df:

import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(("111222333444"), 
                        ("555666777888"))
                    ).toDF()

val df2 = df.withColumn("c1", expr("substring(value, 1, 3)")).withColumn("c2", expr("substring(value, 4, 3)")).withColumn("c3", expr("substring(value, 7, 3)")).withColumn("c4", expr("substring(value, 10, 3)"))
df2.show(false)

退货:

+------------+---+---+---+---+
|value       |c1 |c2 |c3 |c4 |
+------------+---+---+---+---+
|111222333444|111|222|333|444|
|555666777888|555|666|777|888|
+------------+---+---+---+---+

你可以放弃这个值,让你自己决定。
就像上面的答案,但是如果不是所有的3个大小的块会变得复杂。
你的问题更新了100倍:

val df2 = df.withColumn("c1", expr("substring(value, 1, 3)")).withColumn("c2", expr("substring(value, 4, 3)")).withColumn("c3", expr("substring(value, 7, 3)"))
        .withColumn("c4", (expr("substring(value, 10, 3)").cast("double") * 100))

y53ybaqx

y53ybaqx2#

创建测试Dataframe:

scala> var df = Seq(("111222333444"),("555666777888")).toDF("s")
df: org.apache.spark.sql.DataFrame = [s: string]

拆分列 s 放入一个由3个字符组成的块数组:

scala> var res = df.withColumn("temp",split(col("s"),"(?<=\\G...)"))
res: org.apache.spark.sql.DataFrame = [s: string, temp: array<string>]

将数组元素Map到新列:

scala> res = res.select((1 until 5).map(i => col("temp").getItem(i-1).as("c"+i)):_*)
res: org.apache.spark.sql.DataFrame = [c1: string, c2: string ... 2 more fields]

scala> res.show(false)
+---+---+---+---+
|c1 |c2 |c3 |c4 |
+---+---+---+---+
|111|222|333|444|
|555|666|777|888|
+---+---+---+---+

相关问题