spark中rdd并的scala非确定性行为

k10s72fa  于 2021-05-29  发布在  Spark
关注(0)|答案(1)|浏览(409)

我在3 rdd上执行联合操作,我知道联合不保留顺序,但在我的情况下,它是非常奇怪的。有人能解释一下我的代码有什么问题吗??
我有一个由行组成的(mydf)Dataframe并转换为rdd:-

myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":")).map(rec => (2, rec))

myRdd.collect
/*
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887

* /

val rowCount = myRdd.count() // Count of Records in myRdd

val header = "name:country:date:nextdate:1" // random header

// Generating Header Rdd
headerRdd = sparkContext.parallelize(Array(header), 1).map(rec => (1, rec))

//Generating Trailer Rdd
val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1).map(rec => (3, rec))

//Performing Union
val unionRdd = headerRdd.union(myRdd).union(trailerdd).map(rec => rec._2)
unionRdd.saveAsTextFile("pathLocation")

由于并集不保持有序性,所以它不应给出以下结果

输出

name:country:date:nextdate:1
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3

不使用任何排序,怎么可能得到以上输出??

sortByKey("true", 1)

但是当我从headerdd,myrdd和trailerrdd中删除map时,顺序就像

Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
name:country:date:nextdate:1
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3

上述行为的可能原因是什么??

lvmkulzt

lvmkulzt1#

在spark中,特定分区中的元素是无序的,但是分区本身是有序的

相关问题