2023年2月24日 10:41:33go评论93阅读模式

英文:

What is the correct order in data preprocessing stage for Machine Learning?

问题

LOAD DATA

导入 pandas 库
SPLIT DATA - 为了防止"数据泄露"，必须将数据分为训练集（用于处理数据）和测试集（假装它不存在）。

从 sklearn.model_selection 导入 train_test_split
EDA ON TRAINING DATA - 正确的做法是只查看训练集，或者应该在分割之前进行探索数据分析（EDA）？如果我们假定测试集不存在，那么我们不需要关心测试集中的数据，对吗？

查看训练集的信息，描述以及绘制图表等。
OUTLIERS ON TRAINING DATA - 如果需要对数据进行缩放，平均值对异常值非常敏感，因此我们必须在开始时处理异常值。此外，如果我们决定使用平均值填充空值的数值特征，异常值可能会成为问题。

导入 matplotlib.pyplot 和 seaborn 库，检查分布，绘制箱线图，计算特征与标签之间的相关性，绘制散点图等。
MISSING VALUES ON TRAINING DATA - 我们不能有空值。我们要么删除它们，要么填充它们。这一步应该在开始时处理。

查看训练集中的空值信息，并显示包含空值的行。
FEATURE ENGINEERING ON TRAINING DATA - 这一步骤是否应该在开始时处理？我认为应该，因为我们可以创建可能需要缩放的特征。

检查特征之间的相关性，删除相关的特征，调整数据分布以使其接近正态分布等。
CATEGORICAL DATA - 我们不能在数据框中使用对象类型。

从 sklearn.preprocessing 中导入 OneHotEncoder，创建编码后的列。
IMBALANCED DATA - 目标列中最好有相同或相似数量的观测值。

从 imblearn.over_sampling 导入 SMOTE，对训练集进行重新采样以平衡数据集。
SCALE DATA - 在回归任务中是否应该对目标列进行缩放？

从 sklearn.preprocessing 导入 StandardScaler，对特征进行缩放。
PRINCIPAL COMPONENT ANALYSIS (PCA) - REDUCING DIMENSIONALITY - 应该在应用PCA之前对数据进行缩放吗？

从 sklearn.decomposition 导入 PCA，可以在应用PCA之前对数据进行缩放。
MODEL, FIT, EVALUATE, PREDICT

从 sklearn.linear_model 导入 RidgeClassifier 和从 sklearn.metrics 导入各种评估指标，创建模型，拟合模型，执行评估，预测。
SAVE MODEL

从 joblib 导入 joblib，保存模型。

英文:

I am trying to create some sort of step-by-step guide/cheat sheet for myself on how to correctly go over the data preprocessing stage for Machine Learning.

Let's imagine we have a binary Classification problem.
Would the below strategy work or do I have to change/modify the order of some of the steps and maybe something should be added or removed?

1. LOAD DATA

import pandas as pd    
df = pd.read_csv(&quot;data.csv&quot;)

2. SPLIT DATA - I understand, that to prevent "data leakage", we MUST split data into training (work with it) and testing (pretend it does not exist) sets.

from sklearn.model_selection import train_test_split
# stratify = &#39;target&#39; if proportion disbalance in data, so training and testing sets will have the same proportion after splitting.
train_df, test_df = train_test_split(df, test_size = 0.33, random_state = 42, stratify = &#39;target&#39;)

3. EDA ON TRAINING DATA - Is it correct to look at the training set only or should we do EDA before splitting? If we assume the Test set doesn't exist, then we should not care what is there, right?

train_df.info()
train_df.describe()
# + Plots etc.

4. OUTLIERS ON TRAINING DATA - If we have to scale the data, the Mean (Average) is very sensitive to outliers, therefore we have to take care of them in the beginning. Also, if we decide to fill Null numerical features with mean, outliers may be a problem in this case.

import matplotlib.pyplot as plt
import seaborn as sns 
# Check distributions
sns.diplot(train_df)    
sns.boxplot(train_df)   
train_df.corr()    # Correlation between all features and label
train_df.corr()[&quot;target&quot;].sort_values()
sns.scatterplot(x = &quot;Column X&quot;, y = &#39;target&#39;, data = train_df)
train_df.describe() # above 75% + 1.5 * (75% - 25%) and below 25% - 1.5 * (75% - 25%)

5. MISSING VALUES ON TRAINING DATA - We can't have Null values. We either remove or fill in them. This step should be taken care of in the beginning.

train_df.info()
train_df.isnull().sum() # or train_df.isna().sum()
# Show the rows with Null values
train_df[train_df[&quot;Column&quot;].isnull()]

6. FEATURE ENGINEERING ON TRAINING DATA - Is this step should be taken care of in the beginning as well? I think so because we can create the feature that might need to be scaled.

# If some columns (not target) correlated with each other, we should delete one of them, or make some sort of blending.
train_df.corr()
train_df = train_df.drop(&quot;1 of Correlated X Column&quot;, axis = 1)
# For normally distributed data, the skewness should be about 0. A skewness value &gt; 0 means there is more weight in the left tail of the distribution
# We should try to have normal distribution in the columns
train_df[&quot;Not Skewed Column&quot;] = np.log(train_df[&quot;Skewed Column&quot;] + 1)
train_df[&quot;Not Skewed Column&quot;].hist(figsize = (20,5))
plt.show()

7. CATEGORICAL DATA - We can't have objects in the data frame.

from sklearn.preprocessing import OneHotEncoder     # Just an example
# Create X and y variables
X_train = train_df.drop(&#39;target&#39;, axis = 1)
y_train = np.where(train[&#39;target&#39;] == &#39;yes&#39;, 1, 0)
# Create the one hot encoder
onehot = OneHotEncoder(handle_unknown = &#39;ignore&#39;)
# Apply one hot encoding to categorical columns 
encoded_columns = onehot.fit_transform(X_train.select_dtypes(include = &#39;object&#39;)).toarray()
X_train = X_train.select_dtypes(exclude = &#39;object&#39;)
X_train[onehot.get_feature_names_out()] = encoded_columns

8. IMBALANCED DATA - Good to have the same or similar number of observations in the target column.

 from imblearn.over_sampling import SMOTE      # Just an example
 # Create the SMOTE class
 sm = SMOTE(random_state = 42)
 # Resample to balance the dataset
 X_train, y_train = sm.fit_resample(X_train, y_train)

9. SCALE DATA - Should we scale the target column in the Regression task?

# Brings mean close to 0 and std to 1. Formula = (x - mean) / std
from sklearn.preprocessing import StandardScaler      # Just an example
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)    # X_test we don&#39;t fit, only transform!

10. PRINCIPAL COMPONENT ANALYSIS (PCA) - REDUCING DIMENSIONALITY - Should data be scaled before applying PCA?

# Example: PCA = 50 (n_components). Let&#39;s say Input is 100 X features, after applying PCA, Output will be 50 X features.
# Why don&#39;t use PCA all the time? We lose the ability to explain what each value is because they are now in combination with a whole bunch of features. 
# Will not be able to look at feature importance, trees, etc. We use it when we need to. 
# If we are able to train the model with all features, then great. if can&#39;t, we can apply PCA, but be ready to lose the ability to explain what is driving the machine learning model.
from sklearn.decomposition import PCA     # Just an example
pca = PCA(n_components = 50)  # Just an Example
scaled_X_train = pca.fit_transform(scaled_X_train)    # X_test we don&#39;t fit, only transform!

11. MODEL, FIT, EVALUATE, PREDICT

from sklearn.linear_model import RidgeClassifier          # Just an Example  
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix
model = RidgeClassifier()
model.fit(scaled_X_train, y_train)
# HERE we should create and / or execute transformation function that will take test_df as input and will return scaled_X_test and y_test
y_pred = model.predict(scaled_X_test)
# Evaluate model - Calculate Classification metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f&quot;RidgeClassifier model scores Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1-Score: {f1}&quot;)
confusion_matrix(y_test, y_pred, labels = [1,0])

12. SAVE MODEL

import joblib       # Just an example
# Save Model
joblib.dump(model, &#39;best_model.joblib&#39;)

答案1

得分: 1

我建议以下步骤 -

EDA（了解数据）
查找相关性
删除不必要的特征。
处理数据预处理（例如异常值处理，数据编码）
分离特征和目标变量（X 和 Y）
训练测试分割
执行缩放（在训练测试分割之前进行缩放会导致数据泄漏）
根据用例选择算法（基于树的模型不受异常值和不同尺度数据的影响，因此在选择这些模型时可以减少这些步骤）
根据用例选择用于评估模型性能的指标（混淆矩阵、F1分数、精确度、召回率、均方根误差、均方误差）

英文:

I would suggest, the following steps -

EDA(Learn about data)
Finding correlations
Removing unnecessary features.
Working on preprocessing the data(Such as Outlier removal, Encoding Data)
Split features and target variables(X and Y)
Train Test Split
Perform scaling(Scaling before train test split will lead to data leakage)
Choose the algorithm, depending on the usecase (TreeBased models doesn't get effected by outliers and different scale of data so you can reduce those steps while selecting these models)
Depending on the usecase select the metrics to judge your model's performance.(Confusion matrix, f1 score, precision, recall, rmse, mse)

答案2

得分: 0

数据分析和可视化（swarmplot、boxplot...）。
相关性（sns.heatma()）。
检查异常值。
预处理（MinMaxScaler、StandardScaler）。
拆分 x 和 y。
特征重要性（Feature_importances_）。
训练测试拆分：X_train、X_test、y_train、y_test = train_test_split(X.values, y, test_size=0.15, random_state=2)。
选择算法：log_reg、xgboost、svm...
检查指标。

（注意：代码部分不需要翻译，只返回翻译好的部分。）

英文:

Data analysis and visualization (swarmplot, boxplot...).
Correlation(sns.heatma())
Check outliers
Preprocessing(MinMaxScaler, StandardScaler).
Split x--y.
Feature_importances_
train_test_split: X_train, X_test, y_train, y_test = train_test_split(X.values, y, test_size=0.15, random_state=2)
Choosing an algorithm: log_reg, xgboost, svm...

Check Metrics

  ...............#Example 
  X = df[[&#39;age&#39;, &#39;anaemia&#39;, &#39;creatinine_phosphokinase&#39;, 
        &#39;ejection_fraction&#39;, &#39;high_blood_pressure&#39;, &#39;platelets&#39;,
        &#39;serum_creatinine&#39;, &#39;serum_sodium&#39;, &#39;sex&#39;, &#39;smoking&#39;, &#39;time&#39;, &#39;diabetes&#39;]]
  y = df[&#39;DEATH_EVENT&#39;]
   from sklearn.model_selection import train_test_split
   from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
   from sklearn import metrics
   X_train, X_test, y_train, y_test = train_test_split(X.values, y, test_size=0.15, random_state=2, stratify=y)
   from xgboost import XGBClassifier
   model_XGB = XGBClassifier(earning_rate= 0.01, max_depth = 4, n_estimators = 100)
   model_XGB.fit(X_train, y_train)
   Y_pred_XGB = model_XGB.predict(X_test)
   cmXG = confusion_matrix(y_test, Y_pred_XGB)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

数据预处理阶段在机器学习中的正确顺序是什么？

问题

答案1

答案2

模型在相同生成器上的model.fit和model.evaluate中的准确率值不同。

无法使用 requests 和 Beautiful Soup 抓取动画的用户评分详细信息。

识别页面上的第一个元素

如何在django_tables2中提取相关模型以避免大量查询？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。