为正态分布和均匀分布生成随机值

k5hmc34c  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(561)

我必须用spark mllib测试一些算法,我想知道是否有 built-in 解决方案 Spark 为正态分布或均匀分布生成随机双值。
范围 Dataframe 可能是随机的,从一亿到几百万。
有没有有效的方法?

xfb7svmp

xfb7svmp1#

SparkSQL 你有 Random Data Generation SQL functions 很容易做到这一点。
您可以生成填充有均匀正态分布的随机值的列。
这对于随机算法、原型设计和性能测试非常有用。
例如:

import org.apache.spark.sql.functions.{rand, randn}

val dfr = sqlContext.range(0,20) // range can be what you want
val randomValues = dfr.select("id")
                      .withColumn("uniform", rand(10L)) // uniform distribution
                      .withColumn("normal", randn(10L)) // normal distribution

randomValues.show(truncate = false)

输出

+---+-------------------+---------------------+
|id |uniform            |normal               |
+---+-------------------+---------------------+
|0  |0.41371264720975787|-0.5877482396744728  |
|1  |0.7311719281896606 |1.5746327759749246   |
|2  |0.9031701155118229 |-2.087434531229601   |
|3  |0.09430205113458567|1.0191385374853092   |
|4  |0.38340505276222947|-0.011306020094829757|
|5  |0.1982919638208397 |-0.256535324205377   |
|6  |0.12714181165849525|-0.31703264334668824 |
|7  |0.7604318153406678 |0.4977629425313746   |
|8  |0.83487085888236   |0.6400381760855594   |
|9  |0.3142596916968412 |-0.6157521958767469  |
|10 |0.12030715258495939|-0.506853671746243   |
|11 |0.12131363910425985|1.4250903895905769   |
|12 |0.4054302479603469 |0.1478840304856363   |
|13 |0.7658961595628857 |1.1431439803376258   |
|14 |0.5460182640666627 |1.4335019327105383   |
|15 |0.44292918521277047|-0.1413699193557902  |
|16 |0.8898784253886249 |0.9657665088756656   |
|17 |0.03650707717266999|-0.5021009082343131  |
|18 |0.5702126663185123 |0.07606123371426597  |
|19 |0.9212238921510436 |-0.3136534458701739  |
+---+-------------------+---------------------+

相关问题