我试图在sparkDataframe中执行分层采样,但是sampleby函数的行为(奇怪地)类似于sample。
spark版本3.0.1
import spark.implicits._
val data = Seq( ("Java", 20000), ("Java", 10000), ("Java", 3000), ("Java", 17000),
("Python", 100000), ("Python", 20000),
("Scala", 3000), ("Scala", 4000), ("Scala", 1000), ("Scala", 43000), ("Scala", 2000), ("Scala", 9000)).toDF("Language", "Price")
val sample_size = 0.5
val seed = 762387
val stratify = "Language"
val subsample = data.sample(withReplacement=false, fraction=sample_size, seed=seed)
subsample.show()
val fractions = data.select(stratify).distinct().as[String].collect().map((_, sample_size)).toMap
println(fractions.mkString("\n")
val stratified_subsample = data.stat.sampleBy(stratify, fractions=fractions, seed=seed)
stratified_subsample.show()
输出:
+--------+------+
|Language| Price|
+--------+------+
| Java| 3000|
| Python|100000|
| Python| 20000|
| Scala| 3000|
| Scala| 43000|
| Scala| 2000|
| Scala| 9000|
+--------+------+
Scala -> 0.5
Python -> 0.5
Java -> 0.5
+--------+------+
|Language| Price|
+--------+------+
| Java| 3000|
| Python|100000|
| Python| 20000|
| Scala| 3000|
| Scala| 43000|
| Scala| 2000|
| Scala| 9000|
+--------+------+
另一个种子的输出(6354345):
+--------+------+
|Language| Price|
+--------+------+
| Java| 10000|
| Java| 17000|
| Python|100000|
| Scala| 3000|
| Scala| 4000|
| Scala| 1000|
| Scala| 43000|
| Scala| 2000|
| Scala| 9000|
+--------+------+
Scala -> 0.5
Python -> 0.5
Java -> 0.5
+--------+------+
|Language| Price|
+--------+------+
| Java| 10000|
| Java| 17000|
| Python|100000|
| Scala| 3000|
| Scala| 4000|
| Scala| 1000|
| Scala| 43000|
| Scala| 2000|
| Scala| 9000|
+--------+------+
我尝试了不同的Dataframe,不同的种子,两个Dataframe总是相等的。我总是有同样的行为,样本根本没有分层。我知道sampleby并不精确,但有同样的行为似乎不好。我的片段有问题吗?
暂无答案!
目前还没有任何答案,快来回答吧!