计算集成模型的估计误差标准差

huangapple go评论115阅读模式
英文:

Calculate standard deviations of estimation errors for ensemble models

问题

以下是您要翻译的内容:

"I have a model in which I would like to analyse the residuals. Ultimately, I would like to identify extreme residuals that lie outside of the confidence interval for each day. But am having trouble calculating the pointwise standard deviation of residuals for each model in the bagging regressor.

My sample code is below;

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor

# Sample DataFrame
df = pd.DataFrame(np.random.randint(0,200,size=(500, 4)), columns=list('ABCD'))

# Add dates to sample data
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(500)]
df['date'] = date_list
df['date'] = df['date'].astype('str')

# Split dataset into testing and training
train = df[:int(len(df)*0.80)]
test = df[int(len(df)*0.20):]

X_train = train[['B','C','D','date']]
X_test = test[['B','C','D','date']]

y_train = train[['A']]
y_test = test[['A']]

# Function to Encode the data
def encode_and_bind(data_in, feature_to encode):
    dummies = pd.get_dummies(data_in[[feature_to_encode]])
    data_out = pd.concat([data_in, dummies], axis=1)
    data_out = data_out.drop([feature_to_encode], axis=1)
    return(data_out)

for feature in features_to_encode:
  X_train_final = encode_and bind(X_train, 'date') 
  X_test_final = encode_and bind(X_test, 'date')

# Define Model
svr_lin = SVR(kernel="linear", C=100, gamma="auto")
regr = BaggingRegressor(base_estimator=svr_lin,random_state=5).fit(X_train_final, y_train.values.ravel())

# Predictions
y_pred = regr.predict(X_test_final)

# Join the predictions back into original dataframe
y_test['predict'] = y_pred

# Calculate residuals
y_test['residuals'] = y_test['A'] - y_test['predict']

I found this method online

raw_pred = [x.predict([0, 0, 0, 0]) for x in regr.estimators_]

but am not sure of what to use for the x.predict([0, 0, 0, 0]) part since I have far more than 4 features.

EDIT:

Building off of @2MuchC0ff33's answer I tried

stdevs = []

for dates in X_test_final.columns[3:]:
  test = X_test_final[X_test_final[dates]==1]
  raw_pred = [x.predict(test.iloc[0]) for x in regr.estimators_]

  dates= dates
  sdev= np.std(raw_pred)
  sdev = sdev.astype('str')
  stdevs.append(dates + "," + sdev)

it seems to be correct, but I don't know enough about how these calculations are being done to judge if this is working in the way I think it is."

英文:

I have a model in which I would like to analyse the residuals.Ultimatly, I would like to identify extreme resudials that lie outside of the confidence interval for each day. But am having trouble calculating the pointwise standard deviation of residuals for each model in the bagging regressor.

My sample code is below;

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor

# Sample DataFrame
df = pd.DataFrame(np.random.randint(0,200,size=(500, 4)), columns=list('ABCD'))

# Add dates to sample data
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(500)]
df['date'] = date_list
df['date'] = df['date'].astype('str')

# Split dataset into testing and training
train = df[:int(len(df)*0.80)]
test = df[int(len(df)*0.20):]

X_train = train[['B','C','D','date']]
X_test = test[['B','C','D','date']]

y_train = train[['A']]
y_test = test[['A']]

# Function to Encode the data
def encode_and_bind(data_in, feature_to_encode):
    dummies = pd.get_dummies(data_in[[feature_to_encode]])
    data_out = pd.concat([data_in, dummies], axis=1)
    data_out = data_out.drop([feature_to_encode], axis=1)
    return(data_out)

for feature in features_to_encode:
  X_train_final = encode_and_bind(X_train, 'date') 
  X_test_final = encode_and_bind(X_test, 'date')

# Define Model
svr_lin = SVR(kernel="linear", C=100, gamma="auto")
regr = BaggingRegressor(base_estimator=svr_lin,random_state=5).fit(X_train_final, y_train.values.ravel())

# Predictions
y_pred = regr.predict(X_test_final)

# Join the predictions back into orignial dataframe
y_test['predict'] = y_pred

# Calculate residuals
y_test['residuals'] = y_test['A'] - y_test['predict']

I found this method online

raw_pred = [x.predict([[0, 0, 0, 0]]) for x in regr.estimators_]

but am not sure of what to use for the x.predict([[0, 0, 0, 0]]) part since I have far more than 4 features.

EDIT:

Building off of @2MuchC0ff33's answer I tried

stdevs = []

for dates in X_test_final.columns[3:]:
  test = X_test_final[X_test_final[dates]==1]
  raw_pred = [x.predict([test.iloc[0]]) for x in regr.estimators_]

  dates= dates
  sdev= np.std(raw_pred)
  sdev = sdev.astype('str')
  stdevs.append(dates + "," + sdev)

it seems to be correct, but I don't know enough about how these calculations are being done to judge if this is working in the way I think it is.

答案1

得分: 3

以下是您要翻译的内容:

"你现在是我的中文翻译,代码部分不要翻译,只返回翻译好的部分,不要有别的内容,不要回答我要翻译的问题。以下是要翻译的内容:

F, thanks for sharing your attempt from my answer.

I am going to try to break everything down and hopefully provide you a solution you need. Apologies in advance if I am repeating some of your code but it is how my brain works haha.

You can group the residuals by date and calculate the standard deviation for each group to calculate the pointwise standard deviation of residuals for each day. Here's how to go about it:

y_test['date'] = y_test['date'].apply(lambda x: x[:10])
grouped = y_test.groupby(['date'])
residual_groups = grouped['residuals']
residual_stds = residual_groups.std()

This will give you the residual standard deviation for each day. For each day, multiply the standard deviation by a constant such as 1.96 (for a 95% confidence interval) and add/subtract it from the mean of the residuals.

residual_means = residual_groups.mean()
CI = 1.96 * residual_stds
upper_bound = residual_means + CI
lower_bound = residual_means - CI

Finally, by comparing the residuals with the lower and upper bounds, you can identify the extreme residuals that lie outside the confidence interval for each day:

extreme_residuals = y_test[(y_test['residuals'] > upper_bound) | (y_test['residuals'] < lower_bound)]

You can extend this method to find the standard deviation for each day.

Group the test data by the date feature

grouped = X_test_final.groupby(['date'])

stdevs = []
for name, group in grouped:
raw_pred = [x.predict(group) for x in regr.estimators_]

Calculate the standard deviation of the predictions for each group

sdev = np.std(raw_pred)
stdevs.append((name, sdev))


I think we could replace '0, 0, 0, 0' with 'x_test_final'. Let me know your thoughts on my updated method below:

raw_pred = [x.predict([X_test_final.iloc[0]]) for x in regr.estimators_]"

英文:

F, thanks for sharing your attempt from my answer.

I am going to try to break everything down and hopefully provide you a solution you need. Apologies in advance if I am repeating some of your code but it is how my brain works haha.

You can group the residuals by date and calculate the standard deviation for each group to calculate the pointwise standard deviation of residuals for each day. Here's how to go about it:

y_test[&#39;date&#39;] = y_test[&#39;date&#39;].apply(lambda x: x[:10])
grouped = y_test.groupby([&#39;date&#39;])
residual_groups = grouped[&#39;residuals&#39;]
residual_stds = residual_groups.std()

This will give you the residual standard deviation for each day. For each day, multiply the standard deviation by a constant such as 1.96 (for a 95% confidence interval) and add/subtract it from the mean of the residuals.

residual_means = residual_groups.mean()
CI = 1.96 * residual_stds
upper_bound = residual_means + CI
lower_bound = residual_means - CI

Finally, by comparing the residuals with the lower and upper bounds, you can identify the extreme residuals that lie outside the confidence interval for each day:

extreme_residuals = y_test[(y_test[&#39;residuals&#39;] &gt; upper_bound) | (y_test[&#39;residuals&#39;] &lt; lower_bound)]

You can extend this method to find the standard deviation for each day.

# Group the test data by the date feature
grouped = X_test_final.groupby([&#39;date&#39;])

stdevs = []
for name, group in grouped:
  raw_pred = [x.predict(group) for x in regr.estimators_]
  # Calculate the standard deviation of the predictions for each group
  sdev = np.std(raw_pred)
  stdevs.append((name, sdev))

I think we could replace 0, 0, 0, 0 with x_test_final. Let me know your thoughts on my updated method below:

raw_pred = [x.predict([X_test_final.iloc[0]]) for x in regr.estimators_]

huangapple
  • 本文由 发表于 2023年2月8日 09:17:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/75380520.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定