2023年2月8日 09:17:51go评论138阅读模式

英文:

Calculate standard deviations of estimation errors for ensemble models

问题

以下是您要翻译的内容：

"I have a model in which I would like to analyse the residuals. Ultimately, I would like to identify extreme residuals that lie outside of the confidence interval for each day. But am having trouble calculating the pointwise standard deviation of residuals for each model in the bagging regressor.

My sample code is below;

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor
# Sample DataFrame
df = pd.DataFrame(np.random.randint(0,200,size=(500, 4)), columns=list('ABCD'))
# Add dates to sample data
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(500)]
df['date'] = date_list
df['date'] = df['date'].astype('str')
# Split dataset into testing and training
train = df[:int(len(df)*0.80)]
test = df[int(len(df)*0.20):]
X_train = train[['B','C','D','date']]
X_test = test[['B','C','D','date']]
y_train = train[['A']]
y_test = test[['A']]
# Function to Encode the data
def encode_and_bind(data_in, feature_to encode):
    dummies = pd.get_dummies(data_in[[feature_to_encode]])
    data_out = pd.concat([data_in, dummies], axis=1)
    data_out = data_out.drop([feature_to_encode], axis=1)
    return(data_out)
for feature in features_to_encode:
  X_train_final = encode_and bind(X_train, 'date') 
  X_test_final = encode_and bind(X_test, 'date')
# Define Model
svr_lin = SVR(kernel="linear", C=100, gamma="auto")
regr = BaggingRegressor(base_estimator=svr_lin,random_state=5).fit(X_train_final, y_train.values.ravel())
# Predictions
y_pred = regr.predict(X_test_final)
# Join the predictions back into original dataframe
y_test['predict'] = y_pred
# Calculate residuals
y_test['residuals'] = y_test['A'] - y_test['predict']

I found this method online

raw_pred = [x.predict([0, 0, 0, 0]) for x in regr.estimators_]

but am not sure of what to use for the x.predict([0, 0, 0, 0]) part since I have far more than 4 features.

EDIT:

Building off of @2MuchC0ff33's answer I tried

stdevs = []
for dates in X_test_final.columns[3:]:
  test = X_test_final[X_test_final[dates]==1]
  raw_pred = [x.predict(test.iloc[0]) for x in regr.estimators_]
  dates= dates
  sdev= np.std(raw_pred)
  sdev = sdev.astype('str')
  stdevs.append(dates + "," + sdev)

it seems to be correct, but I don't know enough about how these calculations are being done to judge if this is working in the way I think it is."

英文:

I have a model in which I would like to analyse the residuals.Ultimatly, I would like to identify extreme resudials that lie outside of the confidence interval for each day. But am having trouble calculating the pointwise standard deviation of residuals for each model in the bagging regressor.

My sample code is below;

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor
# Sample DataFrame
df = pd.DataFrame(np.random.randint(0,200,size=(500, 4)), columns=list(&#39;ABCD&#39;))
# Add dates to sample data
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(500)]
df[&#39;date&#39;] = date_list
df[&#39;date&#39;] = df[&#39;date&#39;].astype(&#39;str&#39;)
# Split dataset into testing and training
train = df[:int(len(df)*0.80)]
test = df[int(len(df)*0.20):]
X_train = train[[&#39;B&#39;,&#39;C&#39;,&#39;D&#39;,&#39;date&#39;]]
X_test = test[[&#39;B&#39;,&#39;C&#39;,&#39;D&#39;,&#39;date&#39;]]
y_train = train[[&#39;A&#39;]]
y_test = test[[&#39;A&#39;]]
# Function to Encode the data
def encode_and_bind(data_in, feature_to_encode):
    dummies = pd.get_dummies(data_in[[feature_to_encode]])
    data_out = pd.concat([data_in, dummies], axis=1)
    data_out = data_out.drop([feature_to_encode], axis=1)
    return(data_out)
for feature in features_to_encode:
  X_train_final = encode_and_bind(X_train, &#39;date&#39;) 
  X_test_final = encode_and_bind(X_test, &#39;date&#39;)
# Define Model
svr_lin = SVR(kernel=&quot;linear&quot;, C=100, gamma=&quot;auto&quot;)
regr = BaggingRegressor(base_estimator=svr_lin,random_state=5).fit(X_train_final, y_train.values.ravel())
# Predictions
y_pred = regr.predict(X_test_final)
# Join the predictions back into orignial dataframe
y_test[&#39;predict&#39;] = y_pred
# Calculate residuals
y_test[&#39;residuals&#39;] = y_test[&#39;A&#39;] - y_test[&#39;predict&#39;]

I found this method online

raw_pred = [x.predict([[0, 0, 0, 0]]) for x in regr.estimators_]

but am not sure of what to use for the x.predict([[0, 0, 0, 0]]) part since I have far more than 4 features.

EDIT:

Building off of @2MuchC0ff33's answer I tried

stdevs = []
for dates in X_test_final.columns[3:]:
  test = X_test_final[X_test_final[dates]==1]
  raw_pred = [x.predict([test.iloc[0]]) for x in regr.estimators_]
  dates= dates
  sdev= np.std(raw_pred)
  sdev = sdev.astype(&#39;str&#39;)
  stdevs.append(dates + &quot;,&quot; + sdev)

it seems to be correct, but I don't know enough about how these calculations are being done to judge if this is working in the way I think it is.

答案1

得分: 3

以下是您要翻译的内容：

"你现在是我的中文翻译，代码部分不要翻译，只返回翻译好的部分，不要有别的内容，不要回答我要翻译的问题。以下是要翻译的内容：

F, thanks for sharing your attempt from my answer.

I am going to try to break everything down and hopefully provide you a solution you need. Apologies in advance if I am repeating some of your code but it is how my brain works haha.

You can group the residuals by date and calculate the standard deviation for each group to calculate the pointwise standard deviation of residuals for each day. Here's how to go about it:

y_test['date'] = y_test['date'].apply(lambda x: x[:10])
grouped = y_test.groupby(['date'])
residual_groups = grouped['residuals']
residual_stds = residual_groups.std()

This will give you the residual standard deviation for each day. For each day, multiply the standard deviation by a constant such as 1.96 (for a 95% confidence interval) and add/subtract it from the mean of the residuals.

residual_means = residual_groups.mean()
CI = 1.96 * residual_stds
upper_bound = residual_means + CI
lower_bound = residual_means - CI

Finally, by comparing the residuals with the lower and upper bounds, you can identify the extreme residuals that lie outside the confidence interval for each day:

extreme_residuals = y_test[(y_test['residuals'] > upper_bound) | (y_test['residuals'] < lower_bound)]

You can extend this method to find the standard deviation for each day.

Group the test data by the date feature

grouped = X_test_final.groupby(['date'])

stdevs = []
for name, group in grouped:
raw_pred = [x.predict(group) for x in regr.estimators_]

Calculate the standard deviation of the predictions for each group

sdev = np.std(raw_pred)
stdevs.append((name, sdev))

I think we could replace '0, 0, 0, 0' with 'x_test_final'. Let me know your thoughts on my updated method below:

raw_pred = [x.predict([X_test_final.iloc[0]]) for x in regr.estimators_]"

英文:

F, thanks for sharing your attempt from my answer.

I am going to try to break everything down and hopefully provide you a solution you need. Apologies in advance if I am repeating some of your code but it is how my brain works haha.

You can group the residuals by date and calculate the standard deviation for each group to calculate the pointwise standard deviation of residuals for each day. Here's how to go about it:

y_test[&#39;date&#39;] = y_test[&#39;date&#39;].apply(lambda x: x[:10])
grouped = y_test.groupby([&#39;date&#39;])
residual_groups = grouped[&#39;residuals&#39;]
residual_stds = residual_groups.std()

residual_means = residual_groups.mean()
CI = 1.96 * residual_stds
upper_bound = residual_means + CI
lower_bound = residual_means - CI

Finally, by comparing the residuals with the lower and upper bounds, you can identify the extreme residuals that lie outside the confidence interval for each day:

extreme_residuals = y_test[(y_test[&#39;residuals&#39;] &gt; upper_bound) | (y_test[&#39;residuals&#39;] &lt; lower_bound)]

You can extend this method to find the standard deviation for each day.

# Group the test data by the date feature
grouped = X_test_final.groupby([&#39;date&#39;])
stdevs = []
for name, group in grouped:
  raw_pred = [x.predict(group) for x in regr.estimators_]
  # Calculate the standard deviation of the predictions for each group
  sdev = np.std(raw_pred)
  stdevs.append((name, sdev))

I think we could replace 0, 0, 0, 0 with x_test_final. Let me know your thoughts on my updated method below:

raw_pred = [x.predict([X_test_final.iloc[0]]) for x in regr.estimators_]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

计算集成模型的估计误差标准差

问题

答案1

Group the test data by the date feature

Calculate the standard deviation of the predictions for each group

子类带有额外参数的Python类继承

访问特定日期并获取前一天的收盘价

Azure Functions 使用 Python 运行时如何评估本地环境变量？

Python – 在递归循环中查找列表中元素的“深度”

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。