rsd在pyspark的approx_count_distinct中的解释是什么,以及更改它会有什么后果?

huangapple go评论73阅读模式
英文:

What is the interpretation of rsd in pyspark's approx_count_distinct and what are the consequences of changing it?

问题

在pyspark的approx_count_distinct函数中,有一个名为rsd的精度参数。它是如何工作的?如果增加或减小它会有什么权衡?我猜想为了理解这一点,人们应该了解approx_count_distinct的实现方式。你能帮助我在approx_count_distinct的逻辑上理解rsd吗?

英文:

In the pyspark's approx_count_distinct function there is a precision argument rsd. How does it work? What are the tradeoffs if it is increased or decreased? I guess for this one should understand how approx_count_distinct is implemented. Can you help me understand rsd in the context of the logic of approx_count_distinct?

答案1

得分: 1

rsd 是“相对标准偏差”的缩写,其默认值为0.05。通过此值,您可以控制对不同计数的可接受误差。正如@Derek O在他们的评论中所描述的那样,approx_count_distinct 函数在准确性(可以使用 rsd 参数来控制)和计算速度之间进行权衡。

为了更好地理解底层算法,我们可以快速查看 approx_count_distinct 函数的实现。我们可以看到它使用 HyperLogLogPlusPlus 算法(这是 HyperLogLog 算法的改进版)。

  /**
   * 聚合函数:返回组中不同项的近似数量。
   *
   * @param rsd 允许的最大相对标准偏差(默认值 = 0.05)
   *
   * @group agg_funcs
   * @since 2.1.0
   */
  def approx_count_distinct(e: Column, rsd: Double): Column = withAggregateFunction {
    HyperLogLogPlusPlus(e.expr, rsd, 0, 0)
  }

Apache Spark 的 实现 使用以下论文作为 HyperLogLogPlusPlus 算法的基础(在编写本帖时,Spark v3.3.1 版):

英文:

rsd is an abbreviation of "relative standard deviation", and its default value is 0.05. With this value, you control what the error is you're willing to accept on the distinct count. As @Derek O described in their comment above, the approx_count_distinct function makes a tradeoff between accuracy (which you control using the rsd parameter) and speed of the calculation.

To understand the underlying algorithm a bit more, we can have a quick look at the implementation of the approx_count_distinct function. We see that it uses the HyperLogLogPlusPlus algorithm (an improvement over the HyperLogLog algorithm).

  /**
   * Aggregate function: returns the approximate number of distinct items in a group.
   *
   * @param rsd maximum relative standard deviation allowed (default = 0.05)
   *
   * @group agg_funcs
   * @since 2.1.0
   */
  def approx_count_distinct(e: Column, rsd: Double): Column = withAggregateFunction {
    HyperLogLogPlusPlus(e.expr, rsd, 0, 0)
  }

Apache Spark's implementation of this HyperLogLogPlusPlus algorithm is based on the following papers (in Spark v3.3.1, time of writing this post):

huangapple
  • 本文由 发表于 2023年2月6日 16:26:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/75358896.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定