sklearn.impute.SimpleImputer: 无法填充数据框列的最常见值

huangapple go评论58阅读模式
英文:

sklearn.impute.SimpleImputer: Unable to fill in the most common value for a list of dataframe columns

问题

以下是代码的翻译部分:

# 我有一个数据框的列列表,其中包含 NA 值(如下所示)。所有这些列的 'dtype' 都是 'str'。

X_train_objects = ['HomePlanet',
 'Destination',
 'Name',
 'Cabin_letter',
 'Cabin_number',
 'Cabin_letter_2']

# 我想使用 'SimpleImputer' 来填充 NA 值,使用最常见的值(众数)。但是,我遇到了 'ValueError: Columns must be same length as key'。这是为什么,我的代码对我来说似乎是正确的?

# 数据框示例(名为 'X_train')中 'Destination' 列的示例为 'np.NA':

{
 'PassengerId': {47: '0045_02',
  128: '0138_02',
  139: '0152_01',
  347: '0382_01',
  430: '0462_01'},
 'HomePlanet': {47: 'Mars',
  128: 'Earth',
  139: 'Earth',
  347: nan,
  430: 'Earth'},
 'CryoSleep': {47: 1, 128: 0, 139: 0, 347: 0, 430: 1},
 'Destination': {47: nan, 128: nan, 139: nan, 347: nan, 430: nan},
 'Age': {47: 19.0, 128: 34.0, 139: 41.0, 347: 23.0, 430: 50.0},
 'VIP': {47: 0, 128: 0, 139: 0, 347: 0, 430: 0},
 'RoomService': {47: 0.0, 128: 0.0, 139: 0.0, 347: 348.0, 430: 0.0},
 'FoodCourt': {47: 0.0, 128: 22.0, 139: 0.0, 347: 0.0, 430: 0.0},
 'ShoppingMall': {47: 0.0, 128: 0.0, 139: 0.0, 347: 0.0, 430: 0.0},
 'Spa': {47: 0.0, 128: 564.0, 139: 0.0, 347: 4.0, 430: 0.0},
 'VRDeck': {47: 0.0, 128: 207.0, 139: 607.0, 347: 368.0, 430: 0.0},
 'Name': {47: 'Mass Chmad',
  128: 'Monah Gambs',
  139: 'Andan Estron',
  347: 'Blanie Floydendley',
  430: 'Ronia Sosanturney'},
 'Transported': {47: 1, 128: 0, 139: 0, 347: 0, 430: 0},
 'Cabin_letter': {47: 'F', 128: 'E', 139: 'F', 347: 'G', 430: 'G'},
 'Cabin_number': {47: '10', 128: '5', 139: '32', 347: '64', 430: '67'},
 'Cabin_letter_2': {47: 'P', 128: 'P', 139: 'P', 347: 'P', 430: 'S'}
}

# 我的代码:

imputer = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')
X_train[X_train_objects] = imputer.fit_transform(X_train[X_train_objects].values.reshape(-1,1))[:,0]

希望这有帮助。如果你需要进一步的解释或帮助,请告诉我。

英文:

I have a list of columns of a dataframe that have NA's in them (below). The dtype of all these columns is str.

X_train_objects = ['HomePlanet',
'Destination',
'Name',
'Cabin_letter',
'Cabin_number',
'Cabin_letter_2']

I would like to use SimpleImputer to fill in the NA's will the most common value (mode). However, I am getting a ValueError: Columns must be same length as key. What is the reason for this, my code seems correct to me?

Dataframe sample (called X_train) of the Destination column being np.NAs:

{'PassengerId': {47: '0045_02',
128: '0138_02',
139: '0152_01',
347: '0382_01',
430: '0462_01'},
'HomePlanet': {47: 'Mars',
128: 'Earth',
139: 'Earth',
347: nan,
430: 'Earth'},
'CryoSleep': {47: 1, 128: 0, 139: 0, 347: 0, 430: 1},
'Destination': {47: nan, 128: nan, 139: nan, 347: nan, 430: nan},
'Age': {47: 19.0, 128: 34.0, 139: 41.0, 347: 23.0, 430: 50.0},
'VIP': {47: 0, 128: 0, 139: 0, 347: 0, 430: 0},
'RoomService': {47: 0.0, 128: 0.0, 139: 0.0, 347: 348.0, 430: 0.0},
'FoodCourt': {47: 0.0, 128: 22.0, 139: 0.0, 347: 0.0, 430: 0.0},
'ShoppingMall': {47: 0.0, 128: 0.0, 139: 0.0, 347: 0.0, 430: 0.0},
'Spa': {47: 0.0, 128: 564.0, 139: 0.0, 347: 4.0, 430: 0.0},
'VRDeck': {47: 0.0, 128: 207.0, 139: 607.0, 347: 368.0, 430: 0.0},
'Name': {47: 'Mass Chmad',
128: 'Monah Gambs',
139: 'Andan Estron',
347: 'Blanie Floydendley',
430: 'Ronia Sosanturney'},
'Transported': {47: 1, 128: 0, 139: 0, 347: 0, 430: 0},
'Cabin_letter': {47: 'F', 128: 'E', 139: 'F', 347: 'G', 430: 'G'},
'Cabin_number': {47: '10', 128: '5', 139: '32', 347: '64', 430: '67'},
'Cabin_letter_2': {47: 'P', 128: 'P', 139: 'P', 347: 'P', 430: 'S'}}

My Code:

imputer = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')
X_train[X_train_objects] = imputer.fit_transform(X_train[X_train_objects].values.reshape(-1,1))[:,0]

答案1

得分: 2

更新:

根据提问者的反馈,能够得到所需结果的策略是执行以下操作:

X_train[X_train_objects] = imputer.fit_transform(X_train[X_train_objects].values)

原始答案:

你的问题中的代码实际执行以下操作:

  • 处理 X_train[X_train_objects],其形状为 (5, 6)
  • 通过 values 将其转换为一个NumPy数组,并使用 .reshape(-1,1)[:,0] 将其变为一个长度为30的一维数组
  • 将这个一维数组作为参数传递给 imputer.fit_transform,它返回的结果的形状与其输入相同
  • 尝试使用这个长度为30的一维数组来更新 X_train[X_train_objects] 中的所有行,而 X_train[X_train_objects](如前所述)的形状为 (5, 6),具体来说,它只有6列

这会导致错误:ValueError: Columns must be same length as key

我认为你的意图是,在处理了最初在 X_train[X_train_objects] 中找到的值之后,通过使用处理后的值来更新原始对象,覆盖原始值。为了实现这一目标,我认为以下代码应该有效:

X_train[X_train_objects] = (
    imputer.fit_transform(X_train[X_train_objects].values.reshape(-1,1))[:,0]
    .reshape(-1,len(X_train_objects)) )
英文:

UPDATE:

Based on feedback from OP, the strategy that gives the desired result is to do this:

X_train[X_train_objects] = imputer.fit_transform(X_train[X_train_objects].values)

ORIGINAL ANSWER:

Here's what the code in your question does:

  • works with X_train[X_train_objects], which has shape (5, 6)
  • converts it to a numpy array (via values) and changes it to a 1D array of length 30 using .reshape(-1,1)[:,0]
  • passes this as an argument to imputer.fit_transform which returns a result whose shape is the same as its input
  • attempts to use this 1D array of length 30 to update all rows in X_train[X_train_objects] which (as mentioned above) has shape (5, 6), or specifically, has only 6 columns

This gives rise to the error: ValueError: Columns must be same length as key

What I believe you intend is, having massaged the values originally found in X_train[X_train_objects], to update the original object by overwriting the original values with the massaged ones. To do this, I think the following should work:

X_train[X_train_objects] = (
    imputer.fit_transform(X_train[X_train_objects].values.reshape(-1,1))[:,0]
    .reshape(-1,len(X_train_objects)) )

huangapple
  • 本文由 发表于 2023年3月4日 03:56:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/75631378.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定