2023年5月14日 19:16:08go评论64阅读模式

英文:

Pyspark: Standard deviation using reduce throws overflow error

问题

The issue you're encountering in the second method where you use rdd.reduce() to calculate the standard deviation is related to numerical precision and the range of values in your data. When you perform operations on very large numbers or numbers with significant differences from the mean, you can run into numerical instability, leading to overflow errors like the one you're seeing.

In your first method, you're calculating the variance incrementally, which can handle a wider range of values without running into overflow issues.

If you want to continue with the second method and avoid the overflow error, you might consider normalizing your data first by subtracting the mean from each data point and then calculating the squared differences. This can help reduce the range of values you're working with and improve numerical stability:

data = sc.textFile("data.csv")
rdd = data.map(lambda x: float(x.split(" ")[3]))
mean = rdd.mean()

def partial(x, y):
    a = (x - mean) ** 2
    b = (y - mean) ** 2
    return a + b

normalized_rdd = rdd.map(lambda x: x - mean)  # Normalize the data
part_sum = normalized_rdd.reduce(partial)
variance = part_sum / rdd.count()
std_dev = variance ** 0.5

By normalizing the data before calculating the variance, you reduce the risk of overflow errors when dealing with large numbers in the squared differences.

英文:

I'm trying to calculate the standard deviation of a set of data without using rdd.stdev(). I've tried two methods and the one in which I use rdd.reduce() fails and throws the OverflowError: (34, 'Numerical result out of range') error.

If I do it this way, everything seems to work:

data = sc.textFile(&quot;data.csv&quot;) # Space-separated values
rdd  = data.map(lambda x: float(x.split(&quot; &quot;)[3])) # Only the fourth column
mean = rdd.mean() # High number: 1410000

rdd_list = rdd.collect()
part_sum = 0
for i in rdd_list:
    part = (i - mean) ** 2
    part_sum += part

variance = part_sum / rdd.count() # 5.8 trillion
std_dev  = variance ** 0.5 # 2425645. The same if I do rdd.stdev()

However, this way gives me an Overflow error:

data = sc.textFile(&quot;data.csv&quot;) # Same file
rdd  = data.map(lambda x: float(x.split(&quot; &quot;)[3])) # 4th column
mean = rdd.mean() # Same mean

def partial(x, y):
    a = (x - mean) ** 2
    b = (y - mean) ** 2
    return a + b

part_sum = rdd.reduce(partial) # This throws the error: 
File &quot;&lt;stdin&gt;&quot;, line 2, in partial
OverflowError: (34, &#39;Numerical result out of range&#39;)

What is going on? I know using rdd.stdev() is useful and gives me the correct result, I'm just trying to solve it without using it as a practicing exercise, but I don't understand what's happening.

答案1

得分: 1

我的使用 reduce 的方法是错误的。
正确的方法应该是：

num      = rdd.count()
variance = rdd.map(lambda x: ((x - mean)**2)/num).reduce(lambda x, y: x + y)
std_dev  = variance ** 0.5

英文:

My approach using reduce was wrong.
The correct method would be:

num      = rdd.count()
variance = rdd.map(lambda x: ((x - mean)**2)/num).reduce(lambda x, y: x + y)
std_dev  = variance ** 0.5

答案2

得分: 1

partial的实现不正确。它会将大多数值多次添加到总和中。

假设值列表包含1、5和6。均值应为4，根据第一个代码块计算的方差应为4.67（(9+1+4)/3）。

使用reduce机制，以下发生：

步骤1：partial(1, 5)返回9 + 1 = 10
步骤2：partial(10, 6)返回36 + 4 = 40

在将结果除以计数后，方差计算为13.33。问题发生在步骤2，当步骤1的结果用作reduce函数的输入时。它（再次）与均值比较并添加到总和中。总和变得越来越大，导致错误。

而不是使用reduce，aggregate应该起作用：

rdd.aggregate(0, lambda a,b: a+(b-mean)**2, lambda a,b: a+b)

英文:

The implementation of partial is not correct. It would add most values more than once to the sum.

Let's assume the list of values consists of 1, 5 and 6. The mean would be 4 and the variance according to the first code block would be 4.67 ((9+1+4)/3).

Using the reduce mechanism, the following happens:

Step 1: partial(1, 5) returns 9 + 1 = 10
Step 2: partial(10, 6) returns 36 + 4 = 40

After dividing the result by the count, the variance is calculated as 13.33. The problem occurs in step 2, when the result from step 1 is used as input for the reduce function. It is (again) compared to the mean and added to the sum. The sums become larger and larger, resulting in an error.

Instead of using reduce aggregate should work:

rdd.aggregate(0, lambda a,b: a+(b-mean)**2, lambda a,b: a+b)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pyspark：使用reduce计算标准差时抛出溢出错误。

问题

答案1

答案2

如何根据日期范围筛选Parquet分区？

Sympy返回log而不是ln。

Pylint是否可以配置为以不同颜色突出显示Fixme警告？

为什么 PySpark 记录器不记录 INFO 语句？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论