pyspark的approx_count_distinct中对rsd的解释是什么？改变它的后果是什么？

nqwrtyyt 于 2023-02-07 发布在 Spark

关注(0)|答案(1)|浏览(161)

在pyspark的approx_count_distinct函数中，有一个precision参数rsd。它是如何工作的？增加或减少它的利弊是什么？我想对于这一点，人们应该理解approx_count_distinct是如何实现的。你能帮助我在approx_count_distinct的逻辑上下文中理解rsd吗？

pyspark

来源：https://stackoverflow.com/questions/75358896/what-is-the-interpretation-of-rsd-in-pysparks-approx-count-distinct-and-what-ar

1条答案

按热度按时间

ymdaylpp1#

rsd是“相对标准偏差”的缩写，其默认值为0.05。正如@Derek O在他们上面的评论中所描述的，approx_count_distinct函数在精度（您使用rsd参数控制）和计算速度之间进行了权衡。
如果快速浏览一下approx_count_distinct函数的实现，就会发现它使用HyperLogLogPlusPlus算法（HyperLogLog算法的改进）。

/**
   * Aggregate function: returns the approximate number of distinct items in a group.
   *
   * @param rsd maximum relative standard deviation allowed (default = 0.05)
   *
   * @group agg_funcs
   * @since 2.1.0
   */
  def approx_count_distinct(e: Column, rsd: Double): Column = withAggregateFunction {
    HyperLogLogPlusPlus(e.expr, rsd, 0, 0)
  }

Apache Spark对这个HyperLogLogPlusPlus算法的实现基于这些论文（在Spark v3.3.1中，写这篇文章的时间）：

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm（这个链接是断开的，但为了完整起见，我添加了它）
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm
HyperLogLog实践附录：一种最新基数估计算法的算法工程

赞(0）回复(0）举报 2023-02-07

我来回答

pyspark的approx_count_distinct中对rsd的解释是什么？改变它的后果是什么？

1条答案

相关问题

热门标签

最新问答