如何在scala中基于三列过滤数据

py49o6xq  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(400)

我是scala的新手,我想为一个数据集遍历三个for loo并执行一些分析。例如,我的数据如下所示:

Sample.csv

1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9
1,100,1,NA,0,1,0,Friday,1,5
1,100,2,NA,0,1,0,Friday,1,5
1,101,0,NA,0,1,0,Friday,1,5
1,101,1,NA,0,1,0,Friday,1,5
1,101,2,NA,0,1,0,Friday,1,5
1,102,0,NA,0,1,0,Friday,1,5
1,102,1,NA,0,1,0,Friday,1,5
1,102,2,NA,0,1,0,Friday,1,5

现在我读到如下内容:

val data = sc.textFile("C:/users/ricky/Data.csv")

现在我需要为scala中的前三列实现一个过滤器,以过滤整个数据的子集并进行一些分析。因此,我有一个值用于第1列(1),3个值用于第2列(100101102),3个值用于第3列(0,1,2)。所以现在我需要运行过滤器以提供整个数据的子集。使用下面这样的循环好吗

for {
  i <- 1
  j <- 100 to 102
  k <- 1 to 2
}

它应该需要子集数据,比如

1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9

where i=1 ,j=100,and k=0

最多

1,102,2,NA,0,1,0,Friday,1,5

where i=1 ,j=102,and k=2

如何在scala中运行数据(从csv读取)。

lhcgjxsq

lhcgjxsq1#

从文本csv文件读取后,可以使用 filter 以筛选所需的数据

val tempData = data.map(line => line.split(","))
tempData.filter(array => array(0) == "1" && array(1) == "100" && array(2) == "0").foreach(x => println(x.mkString(",")))

这会给你一个结果

1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9

剩下的案子你也可以这么做
DataframeAPI
你可以用 dataframe api的简单性,优化比rdd和更多。第一步是将csv读取到 dataframe 作为

val df = sqlContext.read.format("com.databricks.spark.csv").load("path to csv file")

你会有的

+---+---+---+---+---+---+---+---------+---+---+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7      |_c8|_c9|
+---+---+---+---+---+---+---+---------+---+---+
|1  |100|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |100|0  |NA |0  |1  |0  |Wednesday|1  |9  |
|1  |100|1  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |100|2  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |101|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |101|1  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |101|2  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |102|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |102|1  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |102|2  |NA |0  |1  |0  |Friday   |1  |5  |
+---+---+---+---+---+---+---+---------+---+---+

那你可以用 filter rdd as中的api as

import sqlContext.implicits._
val df1 = df.filter($"_c0" === "1" && $"_c1" === "100" && $"_c2" === "0")

你应该有

+---+---+---+---+---+---+---+---------+---+---+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7      |_c8|_c9|
+---+---+---+---+---+---+---+---------+---+---+
|1  |100|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |100|0  |NA |0  |1  |0  |Wednesday|1  |9  |
+---+---+---+---+---+---+---+---------+---+---+

你甚至可以定义 schema 以获得所需的列名。
编辑
回答你下面的评论,这完全取决于你的输出

scala> val temp = tempData.filter(array => array(0) == "1" && array(1).toInt == "100" && array(2).toInt == "0").map(x => x.mkString(","))
temp: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at map at <console>:28

scala> tempData.filter(array => array(0) == "1" && array(1).toInt == "100" && array(2).toInt == "0")
res9: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[13] at filter at <console>:29

我希望它是清楚的。

相关问题