英文:
Loop to find a maximum R2 in python
问题
I am trying to make a decision tree but optimizing the sampling values to use.
我正在尝试创建一个决策树,但优化采样数值的使用。
I am using a group of values like:
我正在使用一组数值,如下:
DATA1 DATA2 DATA3 VALUE
100 300 400 1.6
102 298 405 1.5
88 275 369 1.9
120 324 417 0.9
103 297 404 1.7
110 310 423 1.1
105 297 401 0.7
099 309 397 1.6
My mission is to make a decision tree so that from Data1, Data2 and Data3 to be able to predict a value of Data to be predicted.
我的任务是创建一个决策树,以便从Data1、Data2和Data3预测要预测的数据的值。
I have started to carry out a classification forest that gives me a coefficient of determination as a result. I attach it below:
我已经开始执行一个分类森林,它给我一个确定系数作为结果。我附上了下面的代码:
#Datos
X = dfs.drop(columns='Dato a predecir')
y = dfs.Datos a predecir
70 % del conjunto de datos para entrenamiento y 30 % para validación
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size = 0.7,
random_state = 0,
)
Crear el modelo para ajustar
bosque = RandomForestClassifier(n_estimators=71,
criterion="gini",
max_features="sqrt",
bootstrap=True,
max_samples = 2/3,
oob_score=True
)
bosque.fit(X_train, y_train)
y_pred = bosque.predict(X_test)
r, p = stats.pearsonr(y_pred,y_test)
print(f"Correlación Pearson: r={r}, p-value={p}")
Well, starting from this code, and thanks to "bootstrap=True" I manage to have a new set of training data and a new coefficient of determination every time I run the code.
好的,从这段代码开始,通过 "bootstrap=True",每次运行代码时我都能获得新的训练数据集和新的确定系数。
Can anyone help me loop this code to get the maximum value of the coefficient of determination and save the training data used so that I can make the optimal decision tree?
有人可以帮我循环运行这段代码,以获取最大的确定系数值,并保存已使用的训练数据,以便我可以创建最佳的决策树吗?
I have tried to perform a for loop but it doesn't really work. It is the following:
我尝试使用for循环,但它并没有真正起作用。如下所示:
for i in range (10000):
while r < 1:
Arbol_decisión(X,y)
r=r
i=i+1
The range used is that it does not represent all the data I have and I would need to find the maximum possible combinations of my data, and the letter "r" represents the value of the coefficient of determination. I am aware that the loop I have made is stupid, but the truth is that I can't think of how to achieve it. Could you help me?
所使用的范围并不能代表我拥有的所有数据,我需要找到数据的最大可能组合,字母 "r" 代表确定系数的值。我知道我做的循环很愚蠢,但实际上我不知道如何实现它。你能帮助我吗?
Many thanks for everything.
非常感谢你的帮助。
I try to be able to perform loops to obtain the largest number of matrices possible and optimize my decision tree.
我尝试执行循环以获得尽可能多的矩阵,并优化我的决策树。
英文:
I am trying to make a decision tree but optimizing the sampling values to use.
I am using a group of values like:
DATA1 DATA2 DATA3 VALUE
100 300 400 1.6
102 298 405 1.5
88 275 369 1.9
120 324 417 0.9
103 297 404 1.7
110 310 423 1.1
105 297 401 0.7
099 309 397 1.6
.
.
.
My mission is to make a decision tree so that from Data1, Data2 and Data3 to be able to predict a value of Data to be predicted.
I have started to carry out a classification forest that gives me a coefficient of determination as a result. I attach it below:
#Datos
X = dfs.drop(columns='Dato a predecir')
y = dfs.Datos a predecir
# 70 % del conjunto de datos para entrenamiento y 30 % para validación
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size = 0.7,
random_state = 0,
)
# Crear el modelo para ajustar
bosque = RandomForestClassifier(n_estimators=71,
criterion="gini",
max_features="sqrt",
bootstrap=True,
max_samples = 2/3,
oob_score=True
)
bosque.fit(X_train, y_train)
y_pred = bosque.predict(X_test)
r, p = stats.pearsonr(y_pred,y_test)
print(f"Correlación Pearson: r={r}, p-value={p}")
Well, starting from this code, and thanks to "bootstrap=True" I manage to have a new set of training data and a new coefficient of determination every time I run the code.
Can anyone help me loop this code to get the maximum value of the coefficient of determination and save the training data used so that I can make the optimal decision tree?
I have tried to perform a for loop but it doesn't really work. It is the following:
for i in range (10000):
while r <1:
Arbol_decisión(X,y)
r=r
i=i+1
The range used is that it does not represent all the data I have and I would need to find the maximum possible combinations of my data, and the letter "r" represents the value of the coefficient of determination. I am aware that the loop I have made is stupid, but the truth is that I can't think of how to achieve it.
Could you help me?
Many thanks for everything.
I try to be able to perform loops to obtain the largest number of matrices possible and optimize my decision tree
答案1
得分: 0
首先,如果您打算这样做,需要使用验证集和测试集。否则,您将只获得偏向测试数据的结果,并可能得到一个在本质上过拟合测试数据的模型。
其次,如果您只是随机抽样数据(这就是自助法的作用),那么所有这些结果都在告诉您,您的数据集不太理想。理想情况下,数据集应该代表来自底层分布的样本。因此,使用更多数据更好,因为您的模型可以更有效地学习底层分布。在您的情况下,您是从某些数据不代表底层分布的角度来解决问题(这就是您想要忽略它们的原因)。如果是这种情况,那么您应该事先妥善清理数据。如果您无法找出一种方法来识别这些“坏”数据点,那我建议不要随意处理它们,因为那只会选择数据并产生糟糕的模型。
我通常建议您暂停编写代码,更多地了解决策树、随机森林和自助法背后的理论。否则,您很可能只会设计糟糕的机器学习实验。
如果出于某种原因您仍然认为这是一个好方法(几乎肯定不是),那么只需自己执行自助法...类似下面的代码(可能存在更优化的解决方案)。
(以下是代码部分,不需要翻译)
英文:
Firstly, you need to use a validation set AND a test set if you're going to approach it like this. Otherwise you will just have biased results and likely a model which is essentially overfit to the testing data.
Secondly, if you are only randomly sampling your data (that's what bootstrap does), then all these results are telling you is that your dataset isn't great. Ideally a dataset should represent samples from the underlying distribution. Therefore, using more data is better as your model can more effectively learn the underlying distribution. In your case you are approaching the problem from the perspective that some of your data does NOT represent the underlying distribution (that's why you want to ignore it). If this is the case, then you should just clean your data properly in advance. If you can't figure out a way to identify these 'bad' data points, then I would not suggest messing around with this - since you would just be cherry-picking data and producing a bad model.
I would generally suggest you take a pause on writing code and read up more on the theory behind decision trees, random forests and bootstrapping. Since otherwise you're likely to only be designing poor ML experiments.
If for some reason you think this is still a good approach (it's almost certainly not), then just do the bootstrapping yourself... Something like the code below (there is probably a more optimised solution).
X = np.arange(1000)
y = np.arange(1000)/100
# Selecting random train/val/test dataset
# Define slices for 60% train, 20% val, 20% test
train_size = slice(0, int(len(X) * 0.6))
val_size = slice(int(len(X) * 0.6), int(len(X) * 0.8))
test_size = slice(int(len(X) * 0.8), int(len(X) * 1))
# Randomise the indices corresponding to X and y
# (same size so only do once)
rnd_idx = np.random.choice(np.arange(len(X)),
len(X),
replace=False)
# Loop through the three dataset sizes and select randomised,
# non-overlapping data for them.
X_tr, X_va, X_te = [X[rnd_idx[sliced]] for sliced in [train_size, val_size, test_size]]
y_tr, y_va, y_te = [X[rnd_idx[sliced]] for sliced in [train_size, val_size, test_size]]
###
### Define random forest here
###
# Define the bootstrap size and method
# Here we are sub-selecting 90% of the training data
bootstrap_size = slice(0, int(len(X_tr) * 0.9))
# And using replacement, so can expect ~30% duplicates of data.
replace = True
# Define an acceptable threshold for performance
acceptable_r = 0.9
# Set initial value (non-physically low)
r = -10
# Do a while loop that repeats until the performance is appropriate
while r < acceptable_r:
# Create randomised indices corresponding to the training set
rnd_idx2 = np.random.choice(np.arange(len(X_tr)),
len(X_tr),
replace=replace)
# Subselect the bootstrapped training data
X_tr_s = X_tr[rnd_idx2[bootstrap_size]]
y_tr_s = y_tr[rnd_idx2[bootstrap_size]]
###
### Fit model here
###
###
### Apply to validation data here
###
###
### Calculate metric here
###
r = r
###
### Apply to testing data here
###
Once the while loop exits, you can retrieve the corresponding training data and the indices and the model etc.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论