spark如何跟踪randomsplit中的拆分？

iqxoj9l9 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(604)

这个问题解释了spark的随机分割是如何工作的，sparks rdd.randomsplit实际上是如何分割rdd的，但是我不明白spark是如何跟踪哪些值被分配到一个分割中，这样那些相同的值就不会被分配到第二个分割中。
如果我们看看randomsplit的实现：

def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] = {
 // It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its
 // constituent partitions each time a split is materialized which could result in
 // overlapping splits. To prevent this, we explicitly sort each input partition to make the
 // ordering deterministic.

 val sorted = Sort(logicalPlan.output.map(SortOrder(_, Ascending)), global = false, logicalPlan)
 val sum = weights.sum
 val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
 normalizedCumWeights.sliding(2).map { x =>
  new DataFrame(sqlContext, Sample(x(0), x(1), withReplacement = false, seed, sorted))
}.toArray
}

我们可以看到，它创建了两个Dataframe，它们共享相同的sqlcontext和两个不同的sample（rs）。
这两个Dataframe如何相互通信，以便第一个Dataframe中的值不包含在第二个Dataframe中？
数据是否被提取了两次(假设sqlcontext是从db中选择的，那么select是否执行了两次？）。

apache-spark apache-spark-sql apache-spark-mllib spark-dataframe

来源：https://stackoverflow.com/questions/62159261/splitting-a-dataset-each-containing-n-records

1条答案

按热度按时间

6yt4nkrj1#

这和rdd取样完全一样。
假设你有权数组 (0.6, 0.2, 0.2) ，spark将为每个范围生成一个Dataframe (0.0, 0.6), (0.6, 0.8), (0.8, 1.0) .
当要读取结果Dataframe时，spark只会遍历父Dataframe。对于每个项目，生成一个随机数，如果该数字在指定范围内，则发出该项目。所有子Dataframe共享相同的随机数生成器（技术上讲，不同的生成器具有相同的种子），因此随机数的序列是确定的。
对于最后一个问题，如果没有缓存父Dataframe，则每次计算输出Dataframe时，都将重新获取输入Dataframe的数据。

赞(0）回复(0）举报 2021-05-27

我来回答

spark如何跟踪randomsplit中的拆分？

1条答案

相关问题

热门标签

最新问答