决策树分类器 // 准确性分数

huangapple go评论89阅读模式
英文:

Decision Tree Classifier // Accuracy Score

问题

I printed the "Best accuracy" print("Best accuracy:", best_accuracy) for my model within the console, and it shows me Best accuracy: 0.88, whereas the accuracy for my specific model is Accuracy: 0.83

Is there any chance to change anything on code or parameters to find out how to get to the best accuracy

  1. Best accuracy: 0.8878504672897196
  2. Best model: DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_leaf=6,
  3. min_samples_split=5)
  4. Accuracy: 0.8333333333333334
  5. Klassifikation Report:
  6. precision recall f1-score support
  7. 0 0.83 0.99 0.90 83
  8. 1 0.89 0.32 0.47 25
  9. accuracy 0.83 108
  10. macro avg 0.86 0.65 0.69 108
  11. weighted avg 0.84 0.83 0.80 108

Parameters are following:

  1. # Training, Validation and Test set split
  2. X_train, X_val_test, Y_train, Y_val_test = train_test_split(X, Y, test_size=0.2 , random_state=42)
  3. X_val, X_test, Y_val, Y_test = train_test_split(X_val_test, Y_val_test, test_size=0.5, random_state=42)
  4. best_clf = None
  5. best_accuracy = 0.0
  6. # Loop over different max_depths
  7. for max_depth in range(1, 20):
  8. # Decision Tree Classifier and Training
  9. clf = DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_split=5, min_samples_leaf=6 )
  10. clf.fit(X_train, Y_train)

See entire code here for more information:

  1. import numpy as np
  2. import pandas as pd
  3. import graphviz
  4. from sklearn.model_selection import train_test_split
  5. from sklearn.tree import DecisionTreeClassifier
  6. from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
  7. from sklearn.tree import export_graphviz
  8. import pydotplus
  9. from sklearn.tree._tree import TREE_LEAF, TREE_UNDEFINED
  10. import matplotlib.pyplot as plt
  11. # Load CSV file
  12. data = pd.read_csv("Basis_DecisionTree_1106.csv", sep=";", header=0)
  13. # Exclude Personalnummer and Maschine
  14. data = data.drop(columns=['PersNr',])
  15. # Convert the categorical variable Maschine
  16. data['Maschine'].replace(['Stock Order', 'New Machine'], [0, 1], inplace=True)
  17. # Convert the columns 'Gehalt_PY' and 'AU_Detail' from string to float
  18. def convert_currency(val):
  19. new_val = val.replace('.', '').replace(',', '.')
  20. return float(new_val)
  21. def convert_decimal(val):
  22. new_val = val.replace(',', '.')
  23. return float(new_val)
  24. data['Gehalt_PY'] = data['Gehalt_PY'].apply(convert_currency)
  25. data['AU_Detail'] = data['AU_Detail'].apply(convert_decimal)
  26. data['VZK_PY'] = data['VZK_PY'].apply(convert_decimal)
  27. # Apply One-Hot-Encoding to the column 'Arbeitsplatz_Technologie'
  28. one_hot = pd.get_dummies(data['Arbeitsplatz_Technologie'])
  29. # Drop column 'Arbeitsplatz_Technologie' as it is now encoded
  30. data = data.drop('Arbeitsplatz_Technologie',axis = 1)
  31. # Join the encoded df
  32. data = data.join(one_hot)
  33. # X & Y Variables
  34. feature_names = ['Age', 'Company Affiliation', 'AU_Detail', 'VZK_PY', 'P_noTravelDays', 'Marital Status', 'Children', 'Salary_PY', 'Machine', 'P_TravelDays', 'AU', 'Presence_PY', 'P_A', 'P_AP', 'P_C', 'P_EW', 'P_EUR', 'P_GER', 'P_MEA', 'P_NAmerica', 'P_SAmerica', 'Dispatching_Level'] + list(one_hot.columns)
  35. Y = data['Churn']
  36. X = data[feature_names]
  37. # Training, Validation and Test set split
  38. X_train, X_val_test, Y_train, Y_val_test = train_test_split(X, Y, test_size=0.2 , random_state=42)
  39. X_val, X_test, Y_val, Y_test = train_test_split(X_val_test, Y_val_test, test_size=0.5, random_state=42)
  40. best_clf = None
  41. best_accuracy = 0.0
  42. # Loop over different max_depths
  43. for max_depth in range(1, 20):
  44. # Decision Tree Classifier and Training
  45. clf = DecisionTreeClassifier(criterion="entropy", max_depth=4, min_samples_split=5, min_samples_leaf=6 )
  46. clf.fit(X_train, Y_train)
  47. # Predictions on the validation set
  48. Y_val_pred = clf.predict(X_val)
  49. # Evaluate the predictions
  50. accuracy = accuracy_score(Y_val, Y_val_pred)
  51. if accuracy > best_accuracy:
  52. best_accuracy = accuracy
  53. best_clf = clf
  54. print("Best accuracy:", best_accuracy)
  55. print("Best model:", best_clf)
  56. # Predictions on the test set with the best model
  57. Y_test_pred = best_clf.predict(X_test)
  58. # Post-Pruning
  59. def is_leaf(inner_tree, index):
  60. # Check whether node is leaf node
  61. return (inner_tree.children_left[index] == TREE_LEAF and
  62. inner_tree.children_right[index] == TREE_LEAF)
  63. def prune_index(inner_tree, decisions, index=0):
  64. # Start pruning from the bottom - if we start from the top, we might miss
  65. # nodes that become leaves during pruning.
  66. # Do not use this directly - use prune_duplicate_leaves instead.
  67. if not is_leaf(inner_tree, inner_tree.children_left[index]):
  68. prune_index(inner_tree, decisions, inner_tree.children_left[index])
  69. if not is_leaf(inner_tree, inner_tree.children_right[index]):
  70. prune_index(inner_tree, decisions, inner_tree.children_right[index])
  71. # Prune children if both children are leaves now and make the same decision:
  72. if (is_leaf(inner_tree, inner_tree.children_left[index]) and
  73. is_leaf(inner_tree, inner_tree.children_right[index]) and
  74. (decisions[index] == decisions[inner_tree.children_left[index]]) and
  75. (decisions[index] == decisions[inner_tree.children_right[index]])):
  76. # turn node into a leaf by "unlinking" its children
  77. inner_tree.children_left[index] = TREE_LEAF
  78. inner_tree.children_right[index] = TREE_LEAF
  79. inner_tree.feature[index] = TREE_UNDEFINED
  80. ##print("Pruned {}".format(index))
  81. def prune_duplicate_leaves(mdl):
  82. # Remove leaves if both
  83. decisions = mdl.tree_.value.argmax(axis=2).flatten().tolist() # Decision for each node
  84. prune_index(mdl.tree_, decisions)
  85. # Feature Importance
  86. importance = best_clf.feature_importances_
  87. # Create a DataFrame from Features and their Importance
  88. feature_importance = pd
  89. <details>
  90. <summary>英文:</summary>
  91. I printed the &quot;Best accuracy&quot; `print(&quot;Best accuracy:&quot;, best_accuracy)` for my model within the console, and it shows me `Best accuracy: 0.88`, whereas the accuracy for my specific model is `Accuracy: 0.83`
  92. Is there any chance to change anything on code or parameters to find out how to get to the `best accuarcy`

Best accuracy: 0.8878504672897196
Best model: DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_leaf=6,
min_samples_split=5)
Accuracy: 0.8333333333333334
Klassifikation Report:
precision recall f1-score support

  1. 0 0.83 0.99 0.90 83
  2. 1 0.89 0.32 0.47 25
  3. accuracy 0.83 108

macro avg 0.86 0.65 0.69 108
weighted avg 0.84 0.83 0.80 108

  1. Parameters are following:

Training, Validation and Test set split

X_train, X_val_test, Y_train, Y_val_test = train_test_split(X, Y, test_size=0.2 , random_state=42)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_test, Y_val_test, test_size=0.5, random_state=42)

best_clf = None
best_accuracy = 0.0

Loop over different max_depths

for max_depth in range(1, 20):

  1. # Decision Tree Classifier and Training
  2. clf = DecisionTreeClassifier(criterion=&quot;entropy&quot;, max_depth=4, min_samples_split=5, min_samples_leaf=6 )
  3. clf.fit(X_train, Y_train)
  1. See entire code here for more information:

import numpy as np
import pandas as pd
import graphviz
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.tree import export_graphviz
import pydotplus
from sklearn.tree._tree import TREE_LEAF, TREE_UNDEFINED
import matplotlib.pyplot as plt

Load CSV file

data = pd.read_csv("Basis_DecisionTree_1106.csv", sep=";", header=0)

Exclude Personalnummer and Maschine

data = data.drop(columns=['PersNr',])

Convert the categorical variable Maschine

data['Maschine'].replace(['Stock Order', 'New Machine'], [0, 1], inplace=True)

Convert the columns 'Gehalt_PY' and 'AU_Detail' from string to float

def convert_currency(val):
new_val = val.replace('.', '').replace(',', '.')
return float(new_val)

def convert_decimal(val):
new_val = val.replace(',', '.')
return float(new_val)

data['Gehalt_PY'] = data['Gehalt_PY'].apply(convert_currency)
data['AU_Detail'] = data['AU_Detail'].apply(convert_decimal)
data['VZK_PY'] = data['VZK_PY'].apply(convert_decimal)

Apply One-Hot-Encoding to the column 'Arbeitsplatz_Technologie'

one_hot = pd.get_dummies(data['Arbeitsplatz_Technologie'])

Drop column 'Arbeitsplatz_Technologie' as it is now encoded

data = data.drop('Arbeitsplatz_Technologie',axis = 1)

Join the encoded df

data = data.join(one_hot)

X & Y Variables

feature_names = ['Age', 'Company Affiliation', 'AU_Detail', 'VZK_PY', 'P_noTravelDays', 'Marital Status', 'Children', 'Salary_PY', 'Machine', 'P_TravelDays', 'AU', 'Presence_PY', 'P_A', 'P_AP', 'P_C', 'P_EW', 'P_EUR', 'P_GER', 'P_MEA', 'P_NAmerica', 'P_SAmerica', 'Dispatching_Level'] + list(one_hot.columns)

Y = data['Churn']
X = data[feature_names]

Training, Validation and Test set split

X_train, X_val_test, Y_train, Y_val_test = train_test_split(X, Y, test_size=0.2 , random_state=42)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_test, Y_val_test, test_size=0.5, random_state=42)

best_clf = None
best_accuracy = 0.0

Loop over different max_depths

for max_depth in range(1, 20):

  1. # Decision Tree Classifier and Training
  2. clf = DecisionTreeClassifier(criterion=&quot;entropy&quot;, max_depth=4, min_samples_split=5, min_samples_leaf=6 )
  3. clf.fit(X_train, Y_train)
  4. # Predictions on the validation set
  5. Y_val_pred = clf.predict(X_val)
  6. # Evaluate the predictions
  7. accuracy = accuracy_score(Y_val, Y_val_pred)
  8. if accuracy &gt; best_accuracy:
  9. best_accuracy = accuracy
  10. best_clf = clf

print("Best accuracy:", best_accuracy)
print("Best model:", best_clf)

Predictions on the test set with the best model

Y_test_pred = best_clf.predict(X_test)

Post-Pruning

def is_leaf(inner_tree, index):
# Check whether node is leaf node
return (inner_tree.children_left[index] == TREE_LEAF and
inner_tree.children_right[index] == TREE_LEAF)

def prune_index(inner_tree, decisions, index=0):
# Start pruning from the bottom - if we start from the top, we might miss
# nodes that become leaves during pruning.
# Do not use this directly - use prune_duplicate_leaves instead.
if not is_leaf(inner_tree, inner_tree.children_left[index]):
prune_index(inner_tree, decisions, inner_tree.children_left[index])
if not is_leaf(inner_tree, inner_tree.children_right[index]):
prune_index(inner_tree, decisions, inner_tree.children_right[index])

  1. # Prune children if both children are leaves now and make the same decision:
  2. if (is_leaf(inner_tree, inner_tree.children_left[index]) and
  3. is_leaf(inner_tree, inner_tree.children_right[index]) and
  4. (decisions[index] == decisions[inner_tree.children_left[index]]) and
  5. (decisions[index] == decisions[inner_tree.children_right[index]])):
  6. # turn node into a leaf by &quot;unlinking&quot; its children
  7. inner_tree.children_left[index] = TREE_LEAF
  8. inner_tree.children_right[index] = TREE_LEAF
  9. inner_tree.feature[index] = TREE_UNDEFINED
  10. ##print(&quot;Pruned {}&quot;.format(index))

def prune_duplicate_leaves(mdl):
# Remove leaves if both
decisions = mdl.tree_.value.argmax(axis=2).flatten().tolist() # Decision for each node
prune_index(mdl.tree_, decisions)

Feature Importance

importance = best_clf.feature_importances_

Create a DataFrame from Features and their Importance

feature_importance = pd.DataFrame(list(zip(feature_names, importance)),
columns = ['Feature', 'Importance'])

Sort the DataFrame by Importance

feature_importance = feature_importance.sort_values('Importance', ascending = False)

Display the sorted Feature Importance

print(feature_importance)

Plot the Feature Importance

plt.bar(feature_importance['Feature'], feature_importance['Importance'])
plt.xticks(rotation='vertical')
plt.show()

# Feature Importance

importance = best_clf.feature_importances_

# summarizing feature importance

for i,v in enumerate(importance):

print('Feature: %s, Score: %.5f' % (feature_names[i],v))

# plot feature importance

plt.bar([x for x in range(len(importance))], importance)

plt.xticks([x for x in range(len(importance))], feature_names, rotation='vertical')

plt.show()

Evaluate the predictions

accuracy = accuracy_score(Y_test, Y_test_pred)
print("Accuracy:", accuracy)

classificationReport = classification_report(Y_test, Y_test_pred)
print("Classification Report:\n", classificationReport)

confusionMatrix = confusion_matrix(Y_test, Y_test_pred)
print("Confusion Matrix:\n", confusionMatrix)

Visualizing the decision tree with Graphviz

dot_data = export_graphviz(best_clf, out_file=None, feature_names=feature_names, class_names=["0", "1"], filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_png('prediction_tree_pruning.png')

  1. Change of Parameters `max_depth=4, min_samples_split=5`
  2. </details>
  3. # 答案1
  4. **得分**: 0
  5. X_val, X_test, Y_val, Y_test = train_test_split(X_val_test, Y_val_test, ...)
  6. "Best accuracy and Accuracy is different" 是因为你从 X_val/Y_val 中获取 "Best accuracy"
  7. ```accuracy_score(Y_val, Y_val_pred)```
  8. 而从 X_test/Y_test 中获取 "accuracy"
  9. ```accuracy = accuracy_score(Y_test, Y_test_pred)```
  10. 由于 X_val/Y_val X_test/Y_test 中的数据不同,不应期望它们的分数相同。
  11. <details>
  12. <summary>英文:</summary>

X_val, X_test, Y_val, Y_test = train_test_split(X_val_test, Y_val_test, ...)

  1. The reason why &quot;Best accuracy and Accuracy is different&quot; is you get &quot;Best accuracy&quot; from X_val/Y_val

accuracy_score(Y_val, Y_val_pred)

  1. and get &quot;accuracy&quot; from X_test/Y_test

accuracy = accuracy_score(Y_test, Y_test_pred)

  1. As data in X_val/Y_val and X_test/Y_test are different, you shouldn&#39;t expect score same in both of them.
  2. </details>

huangapple
  • 本文由 发表于 2023年6月13日 03:57:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76459908.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定