英文:
Pyspark: Standard deviation using reduce throws overflow error
问题
The issue you're encountering in the second method where you use rdd.reduce()
to calculate the standard deviation is related to numerical precision and the range of values in your data. When you perform operations on very large numbers or numbers with significant differences from the mean, you can run into numerical instability, leading to overflow errors like the one you're seeing.
In your first method, you're calculating the variance incrementally, which can handle a wider range of values without running into overflow issues.
If you want to continue with the second method and avoid the overflow error, you might consider normalizing your data first by subtracting the mean from each data point and then calculating the squared differences. This can help reduce the range of values you're working with and improve numerical stability:
data = sc.textFile("data.csv")
rdd = data.map(lambda x: float(x.split(" ")[3]))
mean = rdd.mean()
def partial(x, y):
a = (x - mean) ** 2
b = (y - mean) ** 2
return a + b
normalized_rdd = rdd.map(lambda x: x - mean) # Normalize the data
part_sum = normalized_rdd.reduce(partial)
variance = part_sum / rdd.count()
std_dev = variance ** 0.5
By normalizing the data before calculating the variance, you reduce the risk of overflow errors when dealing with large numbers in the squared differences.
英文:
I'm trying to calculate the standard deviation of a set of data without using rdd.stdev()
. I've tried two methods and the one in which I use rdd.reduce()
fails and throws the OverflowError: (34, 'Numerical result out of range')
error.
If I do it this way, everything seems to work:
data = sc.textFile("data.csv") # Space-separated values
rdd = data.map(lambda x: float(x.split(" ")[3])) # Only the fourth column
mean = rdd.mean() # High number: 1410000
rdd_list = rdd.collect()
part_sum = 0
for i in rdd_list:
part = (i - mean) ** 2
part_sum += part
variance = part_sum / rdd.count() # 5.8 trillion
std_dev = variance ** 0.5 # 2425645. The same if I do rdd.stdev()
However, this way gives me an Overflow error:
data = sc.textFile("data.csv") # Same file
rdd = data.map(lambda x: float(x.split(" ")[3])) # 4th column
mean = rdd.mean() # Same mean
def partial(x, y):
a = (x - mean) ** 2
b = (y - mean) ** 2
return a + b
part_sum = rdd.reduce(partial) # This throws the error:
File "<stdin>", line 2, in partial
OverflowError: (34, 'Numerical result out of range')
What is going on? I know using rdd.stdev()
is useful and gives me the correct result, I'm just trying to solve it without using it as a practicing exercise, but I don't understand what's happening.
答案1
得分: 1
我的使用 reduce
的方法是错误的。
正确的方法应该是:
num = rdd.count()
variance = rdd.map(lambda x: ((x - mean)**2)/num).reduce(lambda x, y: x + y)
std_dev = variance ** 0.5
英文:
My approach using reduce
was wrong.
The correct method would be:
num = rdd.count()
variance = rdd.map(lambda x: ((x - mean)**2)/num).reduce(lambda x, y: x + y)
std_dev = variance ** 0.5
答案2
得分: 1
partial
的实现不正确。它会将大多数值多次添加到总和中。
假设值列表包含1、5和6。均值应为4
,根据第一个代码块计算的方差应为4.67
((9+1+4)/3
)。
使用reduce机制,以下发生:
- 步骤1:
partial(1, 5)
返回9 + 1 = 10
- 步骤2:
partial(10, 6)
返回36 + 4 = 40
在将结果除以计数后,方差计算为13.33
。问题发生在步骤2,当步骤1的结果用作reduce函数的输入时。它(再次)与均值比较并添加到总和中。总和变得越来越大,导致错误。
而不是使用reduce,aggregate应该起作用:
rdd.aggregate(0, lambda a,b: a+(b-mean)**2, lambda a,b: a+b)
英文:
The implementation of partial
is not correct. It would add most values more than once to the sum.
Let's assume the list of values consists of 1, 5 and 6. The mean would be 4
and the variance according to the first code block would be 4.67
((9+1+4)/3
).
Using the reduce mechanism, the following happens:
- Step 1:
partial(1, 5)
returns9 + 1 = 10
- Step 2:
partial(10, 6)
returns36 + 4 = 40
After dividing the result by the count, the variance is calculated as 13.33
. The problem occurs in step 2, when the result from step 1 is used as input for the reduce function. It is (again) compared to the mean and added to the sum. The sums become larger and larger, resulting in an error.
Instead of using reduce aggregate should work:
rdd.aggregate(0, lambda a,b: a+(b-mean)**2, lambda a,b: a+b)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论