数据预处理阶段在机器学习中的正确顺序是什么?

huangapple go评论61阅读模式
英文:

What is the correct order in data preprocessing stage for Machine Learning?

问题

  1. LOAD DATA

    导入 pandas 库

  2. SPLIT DATA - 为了防止"数据泄露",必须将数据分为训练集(用于处理数据)和测试集(假装它不存在)。

    从 sklearn.model_selection 导入 train_test_split

  3. EDA ON TRAINING DATA - 正确的做法是只查看训练集,或者应该在分割之前进行探索数据分析(EDA)?如果我们假定测试集不存在,那么我们不需要关心测试集中的数据,对吗?

    查看训练集的信息,描述以及绘制图表等。

  4. OUTLIERS ON TRAINING DATA - 如果需要对数据进行缩放,平均值对异常值非常敏感,因此我们必须在开始时处理异常值。此外,如果我们决定使用平均值填充空值的数值特征,异常值可能会成为问题。

    导入 matplotlib.pyplot 和 seaborn 库,检查分布,绘制箱线图,计算特征与标签之间的相关性,绘制散点图等。

  5. MISSING VALUES ON TRAINING DATA - 我们不能有空值。我们要么删除它们,要么填充它们。这一步应该在开始时处理。

    查看训练集中的空值信息,并显示包含空值的行。

  6. FEATURE ENGINEERING ON TRAINING DATA - 这一步骤是否应该在开始时处理?我认为应该,因为我们可以创建可能需要缩放的特征。

    检查特征之间的相关性,删除相关的特征,调整数据分布以使其接近正态分布等。

  7. CATEGORICAL DATA - 我们不能在数据框中使用对象类型。

    从 sklearn.preprocessing 中导入 OneHotEncoder,创建编码后的列。

  8. IMBALANCED DATA - 目标列中最好有相同或相似数量的观测值。

    从 imblearn.over_sampling 导入 SMOTE,对训练集进行重新采样以平衡数据集。

  9. SCALE DATA - 在回归任务中是否应该对目标列进行缩放?

    从 sklearn.preprocessing 导入 StandardScaler,对特征进行缩放。

  10. PRINCIPAL COMPONENT ANALYSIS (PCA) - REDUCING DIMENSIONALITY - 应该在应用PCA之前对数据进行缩放吗?

    从 sklearn.decomposition 导入 PCA,可以在应用PCA之前对数据进行缩放。

  11. MODEL, FIT, EVALUATE, PREDICT

    从 sklearn.linear_model 导入 RidgeClassifier 和从 sklearn.metrics 导入各种评估指标,创建模型,拟合模型,执行评估,预测。

  12. SAVE MODEL

    从 joblib 导入 joblib,保存模型。

英文:

I am trying to create some sort of step-by-step guide/cheat sheet for myself on how to correctly go over the data preprocessing stage for Machine Learning.

Let's imagine we have a binary Classification problem.
Would the below strategy work or do I have to change/modify the order of some of the steps and maybe something should be added or removed?

1. LOAD DATA

import pandas as pd    

df = pd.read_csv("data.csv")

2. SPLIT DATA - I understand, that to prevent "data leakage", we MUST split data into training (work with it) and testing (pretend it does not exist) sets.

from sklearn.model_selection import train_test_split

# stratify = 'target' if proportion disbalance in data, so training and testing sets will have the same proportion after splitting.
train_df, test_df = train_test_split(df, test_size = 0.33, random_state = 42, stratify = 'target')    

3. EDA ON TRAINING DATA - Is it correct to look at the training set only or should we do EDA before splitting? If we assume the Test set doesn't exist, then we should not care what is there, right?

train_df.info()
train_df.describe()
# + Plots etc.

4. OUTLIERS ON TRAINING DATA - If we have to scale the data, the Mean (Average) is very sensitive to outliers, therefore we have to take care of them in the beginning. Also, if we decide to fill Null numerical features with mean, outliers may be a problem in this case.

import matplotlib.pyplot as plt
import seaborn as sns 

# Check distributions
sns.diplot(train_df)    
sns.boxplot(train_df)   
train_df.corr()    # Correlation between all features and label
train_df.corr()["target"].sort_values()
sns.scatterplot(x = "Column X", y = 'target', data = train_df)

train_df.describe() # above 75% + 1.5 * (75% - 25%) and below 25% - 1.5 * (75% - 25%)

5. MISSING VALUES ON TRAINING DATA - We can't have Null values. We either remove or fill in them. This step should be taken care of in the beginning.

train_df.info()
train_df.isnull().sum() # or train_df.isna().sum()
# Show the rows with Null values
train_df[train_df["Column"].isnull()]      

6. FEATURE ENGINEERING ON TRAINING DATA - Is this step should be taken care of in the beginning as well? I think so because we can create the feature that might need to be scaled.

# If some columns (not target) correlated with each other, we should delete one of them, or make some sort of blending.
train_df.corr()
train_df = train_df.drop("1 of Correlated X Column", axis = 1)

# For normally distributed data, the skewness should be about 0. A skewness value > 0 means there is more weight in the left tail of the distribution
# We should try to have normal distribution in the columns
train_df["Not Skewed Column"] = np.log(train_df["Skewed Column"] + 1)
train_df["Not Skewed Column"].hist(figsize = (20,5))
plt.show()

7. CATEGORICAL DATA - We can't have objects in the data frame.

from sklearn.preprocessing import OneHotEncoder     # Just an example

# Create X and y variables
X_train = train_df.drop('target', axis = 1)
y_train = np.where(train['target'] == 'yes', 1, 0)

# Create the one hot encoder
onehot = OneHotEncoder(handle_unknown = 'ignore')

# Apply one hot encoding to categorical columns 
encoded_columns = onehot.fit_transform(X_train.select_dtypes(include = 'object')).toarray()

X_train = X_train.select_dtypes(exclude = 'object')
X_train[onehot.get_feature_names_out()] = encoded_columns

8. IMBALANCED DATA - Good to have the same or similar number of observations in the target column.

 from imblearn.over_sampling import SMOTE      # Just an example

 # Create the SMOTE class
 sm = SMOTE(random_state = 42)

 # Resample to balance the dataset
 X_train, y_train = sm.fit_resample(X_train, y_train)

9. SCALE DATA - Should we scale the target column in the Regression task?

# Brings mean close to 0 and std to 1. Formula = (x - mean) / std
from sklearn.preprocessing import StandardScaler      # Just an example

scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)    # X_test we don't fit, only transform!

10. PRINCIPAL COMPONENT ANALYSIS (PCA) - REDUCING DIMENSIONALITY - Should data be scaled before applying PCA?

# Example: PCA = 50 (n_components). Let's say Input is 100 X features, after applying PCA, Output will be 50 X features.
# Why don't use PCA all the time? We lose the ability to explain what each value is because they are now in combination with a whole bunch of features. 
# Will not be able to look at feature importance, trees, etc. We use it when we need to. 
# If we are able to train the model with all features, then great. if can't, we can apply PCA, but be ready to lose the ability to explain what is driving the machine learning model.

from sklearn.decomposition import PCA     # Just an example

pca = PCA(n_components = 50)  # Just an Example
scaled_X_train = pca.fit_transform(scaled_X_train)    # X_test we don't fit, only transform!

11. MODEL, FIT, EVALUATE, PREDICT

from sklearn.linear_model import RidgeClassifier          # Just an Example  
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix

model = RidgeClassifier()
model.fit(scaled_X_train, y_train)

# HERE we should create and / or execute transformation function that will take test_df as input and will return scaled_X_test and y_test

y_pred = model.predict(scaled_X_test)

# Evaluate model - Calculate Classification metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"RidgeClassifier model scores Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1-Score: {f1}")

confusion_matrix(y_test, y_pred, labels = [1,0])

12. SAVE MODEL

import joblib       # Just an example

# Save Model
joblib.dump(model, 'best_model.joblib')  

答案1

得分: 1

我建议以下步骤 -

  1. EDA(了解数据)
  2. 查找相关性
  3. 删除不必要的特征。
  4. 处理数据预处理(例如异常值处理,数据编码)
  5. 分离特征和目标变量(X 和 Y)
  6. 训练测试分割
  7. 执行缩放(在训练测试分割之前进行缩放会导致数据泄漏)
  8. 根据用例选择算法(基于树的模型不受异常值和不同尺度数据的影响,因此在选择这些模型时可以减少这些步骤)
  9. 根据用例选择用于评估模型性能的指标(混淆矩阵、F1分数、精确度、召回率、均方根误差、均方误差)
英文:

I would suggest, the following steps -

  1. EDA(Learn about data)
  2. Finding correlations
  3. Removing unnecessary features.
  4. Working on preprocessing the data(Such as Outlier removal, Encoding Data)
  5. Split features and target variables(X and Y)
  6. Train Test Split
  7. Perform scaling(Scaling before train test split will lead to data leakage)
  8. Choose the algorithm, depending on the usecase (TreeBased models doesn't get effected by outliers and different scale of data so you can reduce those steps while selecting these models)
  9. Depending on the usecase select the metrics to judge your model's performance.(Confusion matrix, f1 score, precision, recall, rmse, mse)

答案2

得分: 0

  1. 数据分析和可视化(swarmplot、boxplot...)。
  2. 相关性(sns.heatma())。
  3. 检查异常值。
  4. 预处理(MinMaxScaler、StandardScaler)。
  5. 拆分 x 和 y。
  6. 特征重要性(Feature_importances_)。
  7. 训练测试拆分:X_train、X_test、y_train、y_test = train_test_split(X.values, y, test_size=0.15, random_state=2)。
  8. 选择算法:log_reg、xgboost、svm...
  9. 检查指标。

(注意:代码部分不需要翻译,只返回翻译好的部分。)

英文:
  1. Data analysis and visualization (swarmplot, boxplot...).

  2. Correlation(sns.heatma())

  3. Check outliers

  4. Preprocessing(MinMaxScaler, StandardScaler).

  5. Split x--y.

  6. Feature_importances_

  7. train_test_split: X_train, X_test, y_train, y_test = train_test_split(X.values, y, test_size=0.15, random_state=2)

  8. Choosing an algorithm: log_reg, xgboost, svm...

  9. Check Metrics

      ...............#Example 
      X = df[['age', 'anaemia', 'creatinine_phosphokinase', 
            'ejection_fraction', 'high_blood_pressure', 'platelets',
            'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time', 'diabetes']]
      y = df['DEATH_EVENT']
    
       from sklearn.model_selection import train_test_split
       from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
       from sklearn import metrics
       X_train, X_test, y_train, y_test = train_test_split(X.values, y, test_size=0.15, random_state=2, stratify=y)
    
    
       from xgboost import XGBClassifier
       model_XGB = XGBClassifier(earning_rate= 0.01, max_depth = 4, n_estimators = 100)
       model_XGB.fit(X_train, y_train)
       Y_pred_XGB = model_XGB.predict(X_test)
    
       cmXG = confusion_matrix(y_test, Y_pred_XGB)
    

huangapple
  • 本文由 发表于 2023年2月24日 10:41:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/75552168.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定