2023年6月13日 03:57:53go评论89阅读模式

英文:

Decision Tree Classifier // Accuracy Score

问题

I printed the "Best accuracy" print("Best accuracy:", best_accuracy) for my model within the console, and it shows me Best accuracy: 0.88, whereas the accuracy for my specific model is Accuracy: 0.83

Is there any chance to change anything on code or parameters to find out how to get to the best accuracy

Best accuracy: 0.8878504672897196
Best model: DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_leaf=6,
                       min_samples_split=5)
Accuracy: 0.8333333333333334
Klassifikation Report:
               precision    recall  f1-score   support
           0       0.83      0.99      0.90        83
           1       0.89      0.32      0.47        25
    accuracy                           0.83       108
   macro avg       0.86      0.65      0.69       108
weighted avg       0.84      0.83      0.80       108

Parameters are following:

# Training, Validation and Test set split
X_train, X_val_test, Y_train, Y_val_test = train_test_split(X, Y, test_size=0.2 , random_state=42)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_test, Y_val_test, test_size=0.5, random_state=42)
best_clf = None
best_accuracy = 0.0
# Loop over different max_depths
for max_depth in range(1, 20):  
    # Decision Tree Classifier and Training
    clf = DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_split=5, min_samples_leaf=6 )
    clf.fit(X_train, Y_train)

See entire code here for more information:

import numpy as np
import pandas as pd
import graphviz
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.tree import export_graphviz
import pydotplus
from sklearn.tree._tree import TREE_LEAF, TREE_UNDEFINED
import matplotlib.pyplot as plt
# Load CSV file
data = pd.read_csv("Basis_DecisionTree_1106.csv", sep=";", header=0)
# Exclude Personalnummer and Maschine
data = data.drop(columns=['PersNr',])
# Convert the categorical variable Maschine
data['Maschine'].replace(['Stock Order', 'New Machine'], [0, 1], inplace=True)
# Convert the columns 'Gehalt_PY' and 'AU_Detail' from string to float
def convert_currency(val):
    new_val = val.replace('.', '').replace(',', '.')
    return float(new_val)
def convert_decimal(val):
    new_val = val.replace(',', '.')
    return float(new_val)
data['Gehalt_PY'] = data['Gehalt_PY'].apply(convert_currency)
data['AU_Detail'] = data['AU_Detail'].apply(convert_decimal)
data['VZK_PY'] = data['VZK_PY'].apply(convert_decimal)
# Apply One-Hot-Encoding to the column 'Arbeitsplatz_Technologie'
one_hot = pd.get_dummies(data['Arbeitsplatz_Technologie'])
# Drop column 'Arbeitsplatz_Technologie' as it is now encoded
data = data.drop('Arbeitsplatz_Technologie',axis = 1)
# Join the encoded df
data = data.join(one_hot)
# X & Y Variables
feature_names = ['Age', 'Company Affiliation', 'AU_Detail', 'VZK_PY', 'P_noTravelDays', 'Marital Status', 'Children', 'Salary_PY', 'Machine', 'P_TravelDays', 'AU', 'Presence_PY', 'P_A', 'P_AP', 'P_C', 'P_EW', 'P_EUR', 'P_GER', 'P_MEA', 'P_NAmerica', 'P_SAmerica', 'Dispatching_Level'] + list(one_hot.columns)
Y = data['Churn']
X = data[feature_names]
# Training, Validation and Test set split
X_train, X_val_test, Y_train, Y_val_test = train_test_split(X, Y, test_size=0.2 , random_state=42)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_test, Y_val_test, test_size=0.5, random_state=42)
best_clf = None
best_accuracy = 0.0
# Loop over different max_depths
for max_depth in range(1, 20):  
    # Decision Tree Classifier and Training
    clf = DecisionTreeClassifier(criterion="entropy", max_depth=4, min_samples_split=5, min_samples_leaf=6 )
    clf.fit(X_train, Y_train)
    # Predictions on the validation set
    Y_val_pred = clf.predict(X_val)
    # Evaluate the predictions
    accuracy = accuracy_score(Y_val, Y_val_pred)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_clf = clf
print("Best accuracy:", best_accuracy)
print("Best model:", best_clf)
# Predictions on the test set with the best model
Y_test_pred = best_clf.predict(X_test)
# Post-Pruning
def is_leaf(inner_tree, index):
    # Check whether node is leaf node
    return (inner_tree.children_left[index] == TREE_LEAF and 
            inner_tree.children_right[index] == TREE_LEAF)
def prune_index(inner_tree, decisions, index=0):
    # Start pruning from the bottom - if we start from the top, we might miss
    # nodes that become leaves during pruning.
    # Do not use this directly - use prune_duplicate_leaves instead.
    if not is_leaf(inner_tree, inner_tree.children_left[index]):
        prune_index(inner_tree, decisions, inner_tree.children_left[index])
    if not is_leaf(inner_tree, inner_tree.children_right[index]):
        prune_index(inner_tree, decisions, inner_tree.children_right[index])
    # Prune children if both children are leaves now and make the same decision:     
    if (is_leaf(inner_tree, inner_tree.children_left[index]) and
        is_leaf(inner_tree, inner_tree.children_right[index]) and
        (decisions[index] == decisions[inner_tree.children_left[index]]) and 
        (decisions[index] == decisions[inner_tree.children_right[index]])):
        # turn node into a leaf by "unlinking" its children
        inner_tree.children_left[index] = TREE_LEAF
        inner_tree.children_right[index] = TREE_LEAF
        inner_tree.feature[index] = TREE_UNDEFINED
        ##print("Pruned {}".format(index))
def prune_duplicate_leaves(mdl):
    # Remove leaves if both 
    decisions = mdl.tree_.value.argmax(axis=2).flatten().tolist() # Decision for each node
    prune_index(mdl.tree_, decisions)
    
# Feature Importance
importance = best_clf.feature_importances_
# Create a DataFrame from Features and their Importance
feature_importance = pd
<details>
<summary>英文:</summary>
I printed the &quot;Best accuracy&quot; `print(&quot;Best accuracy:&quot;, best_accuracy)` for my model within the console, and it shows me `Best accuracy: 0.88`, whereas the accuracy for my specific model is `Accuracy: 0.83`
Is there any chance to change anything on code or parameters to find out how to get to the `best accuarcy`

Best accuracy: 0.8878504672897196
Best model: DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_leaf=6,
min_samples_split=5)
Accuracy: 0.8333333333333334
Klassifikation Report:
precision recall f1-score support

       0       0.83      0.99      0.90        83
1       0.89      0.32      0.47        25
accuracy                           0.83       108

macro avg 0.86 0.65 0.69 108
weighted avg 0.84 0.83 0.80 108


Parameters are following:

Training, Validation and Test set split

X_train, X_val_test, Y_train, Y_val_test = train_test_split(X, Y, test_size=0.2 , random_state=42)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_test, Y_val_test, test_size=0.5, random_state=42)

best_clf = None
best_accuracy = 0.0

Loop over different max_depths

for max_depth in range(1, 20):

# Decision Tree Classifier and Training
clf = DecisionTreeClassifier(criterion=&quot;entropy&quot;, max_depth=4, min_samples_split=5, min_samples_leaf=6 )
clf.fit(X_train, Y_train)


See entire code here for more information:

import numpy as np
import pandas as pd
import graphviz
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.tree import export_graphviz
import pydotplus
from sklearn.tree._tree import TREE_LEAF, TREE_UNDEFINED
import matplotlib.pyplot as plt

Load CSV file

data = pd.read_csv("Basis_DecisionTree_1106.csv", sep=";", header=0)

Exclude Personalnummer and Maschine

data = data.drop(columns=['PersNr',])

Convert the categorical variable Maschine

data['Maschine'].replace(['Stock Order', 'New Machine'], [0, 1], inplace=True)

Convert the columns 'Gehalt_PY' and 'AU_Detail' from string to float

def convert_currency(val):
new_val = val.replace('.', '').replace(',', '.')
return float(new_val)

def convert_decimal(val):
new_val = val.replace(',', '.')
return float(new_val)

data['Gehalt_PY'] = data['Gehalt_PY'].apply(convert_currency)
data['AU_Detail'] = data['AU_Detail'].apply(convert_decimal)
data['VZK_PY'] = data['VZK_PY'].apply(convert_decimal)

Apply One-Hot-Encoding to the column 'Arbeitsplatz_Technologie'

one_hot = pd.get_dummies(data['Arbeitsplatz_Technologie'])

Drop column 'Arbeitsplatz_Technologie' as it is now encoded

data = data.drop('Arbeitsplatz_Technologie',axis = 1)

Join the encoded df

data = data.join(one_hot)

X & Y Variables

feature_names = ['Age', 'Company Affiliation', 'AU_Detail', 'VZK_PY', 'P_noTravelDays', 'Marital Status', 'Children', 'Salary_PY', 'Machine', 'P_TravelDays', 'AU', 'Presence_PY', 'P_A', 'P_AP', 'P_C', 'P_EW', 'P_EUR', 'P_GER', 'P_MEA', 'P_NAmerica', 'P_SAmerica', 'Dispatching_Level'] + list(one_hot.columns)

Y = data['Churn']
X = data[feature_names]

Training, Validation and Test set split

best_clf = None
best_accuracy = 0.0

Loop over different max_depths

for max_depth in range(1, 20):

# Decision Tree Classifier and Training
clf = DecisionTreeClassifier(criterion=&quot;entropy&quot;, max_depth=4, min_samples_split=5, min_samples_leaf=6 )
clf.fit(X_train, Y_train)
# Predictions on the validation set
Y_val_pred = clf.predict(X_val)
# Evaluate the predictions
accuracy = accuracy_score(Y_val, Y_val_pred)
if accuracy &gt; best_accuracy:
best_accuracy = accuracy
best_clf = clf

print("Best accuracy:", best_accuracy)
print("Best model:", best_clf)

Predictions on the test set with the best model

Y_test_pred = best_clf.predict(X_test)

Post-Pruning

def is_leaf(inner_tree, index):
# Check whether node is leaf node
return (inner_tree.children_left[index] == TREE_LEAF and
inner_tree.children_right[index] == TREE_LEAF)

def prune_index(inner_tree, decisions, index=0):
# Start pruning from the bottom - if we start from the top, we might miss
# nodes that become leaves during pruning.
# Do not use this directly - use prune_duplicate_leaves instead.
if not is_leaf(inner_tree, inner_tree.children_left[index]):
prune_index(inner_tree, decisions, inner_tree.children_left[index])
if not is_leaf(inner_tree, inner_tree.children_right[index]):
prune_index(inner_tree, decisions, inner_tree.children_right[index])

# Prune children if both children are leaves now and make the same decision:     
if (is_leaf(inner_tree, inner_tree.children_left[index]) and
is_leaf(inner_tree, inner_tree.children_right[index]) and
(decisions[index] == decisions[inner_tree.children_left[index]]) and 
(decisions[index] == decisions[inner_tree.children_right[index]])):
# turn node into a leaf by &quot;unlinking&quot; its children
inner_tree.children_left[index] = TREE_LEAF
inner_tree.children_right[index] = TREE_LEAF
inner_tree.feature[index] = TREE_UNDEFINED
##print(&quot;Pruned {}&quot;.format(index))

def prune_duplicate_leaves(mdl):
# Remove leaves if both
decisions = mdl.tree_.value.argmax(axis=2).flatten().tolist() # Decision for each node
prune_index(mdl.tree_, decisions)

Feature Importance

importance = best_clf.feature_importances_

Create a DataFrame from Features and their Importance

feature_importance = pd.DataFrame(list(zip(feature_names, importance)),
columns = ['Feature', 'Importance'])

Sort the DataFrame by Importance

feature_importance = feature_importance.sort_values('Importance', ascending = False)

Display the sorted Feature Importance

print(feature_importance)

Plot the Feature Importance

plt.bar(feature_importance['Feature'], feature_importance['Importance'])
plt.xticks(rotation='vertical')
plt.show()

# Feature Importance

importance = best_clf.feature_importances_

# summarizing feature importance

for i,v in enumerate(importance):

print('Feature: %s, Score: %.5f' % (feature_names[i],v))

# plot feature importance

plt.bar([x for x in range(len(importance))], importance)

plt.xticks([x for x in range(len(importance))], feature_names, rotation='vertical')

plt.show()

Evaluate the predictions

accuracy = accuracy_score(Y_test, Y_test_pred)
print("Accuracy:", accuracy)

classificationReport = classification_report(Y_test, Y_test_pred)
print("Classification Report:\n", classificationReport)

confusionMatrix = confusion_matrix(Y_test, Y_test_pred)
print("Confusion Matrix:\n", confusionMatrix)

Visualizing the decision tree with Graphviz

dot_data = export_graphviz(best_clf, out_file=None, feature_names=feature_names, class_names=["0", "1"], filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_png('prediction_tree_pruning.png')


Change of Parameters `max_depth=4, min_samples_split=5`
</details>
# 答案1
**得分**: 0
X_val, X_test, Y_val, Y_test = train_test_split(X_val_test, Y_val_test, ...)
"Best accuracy and Accuracy is different" 是因为你从 X_val/Y_val 中获取 "Best accuracy"，
```accuracy_score(Y_val, Y_val_pred)```
而从 X_test/Y_test 中获取 "accuracy"，
```accuracy = accuracy_score(Y_test, Y_test_pred)```
由于 X_val/Y_val 和 X_test/Y_test 中的数据不同，不应期望它们的分数相同。
<details>
<summary>英文:</summary>

X_val, X_test, Y_val, Y_test = train_test_split(X_val_test, Y_val_test, ...)


The reason why &quot;Best accuracy and Accuracy is different&quot; is you get &quot;Best accuracy&quot; from X_val/Y_val

accuracy_score(Y_val, Y_val_pred)

and get &quot;accuracy&quot; from X_test/Y_test

accuracy = accuracy_score(Y_test, Y_test_pred)


As data in X_val/Y_val and X_test/Y_test are different, you shouldn&#39;t expect score same in both of them.
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

问题

Training, Validation and Test set split

Loop over different max_depths

Load CSV file

Exclude Personalnummer and Maschine

Convert the categorical variable Maschine

Convert the columns 'Gehalt_PY' and 'AU_Detail' from string to float

Apply One-Hot-Encoding to the column 'Arbeitsplatz_Technologie'

Drop column 'Arbeitsplatz_Technologie' as it is now encoded

Join the encoded df

X & Y Variables

Training, Validation and Test set split

Loop over different max_depths

Predictions on the test set with the best model

Post-Pruning

Feature Importance

Create a DataFrame from Features and their Importance

Sort the DataFrame by Importance

Display the sorted Feature Importance

Plot the Feature Importance

# Feature Importance

importance = best_clf.feature_importances_

# summarizing feature importance

for i,v in enumerate(importance):

print('Feature: %s, Score: %.5f' % (feature_names[i],v))

# plot feature importance

plt.bar([x for x in range(len(importance))], importance)

plt.xticks([x for x in range(len(importance))], feature_names, rotation='vertical')

plt.show()

Evaluate the predictions

Visualizing the decision tree with Graphviz

发表评论