2023年3月4日 03:56:27go评论68阅读模式

英文:

sklearn.impute.SimpleImputer: Unable to fill in the most common value for a list of dataframe columns

问题

以下是代码的翻译部分：

# 我有一个数据框的列列表，其中包含 NA 值（如下所示）。所有这些列的 'dtype' 都是 'str'。

X_train_objects = ['HomePlanet',
 'Destination',
 'Name',
 'Cabin_letter',
 'Cabin_number',
 'Cabin_letter_2']

# 我想使用 'SimpleImputer' 来填充 NA 值，使用最常见的值（众数）。但是，我遇到了 'ValueError: Columns must be same length as key'。这是为什么，我的代码对我来说似乎是正确的？

# 数据框示例（名为 'X_train'）中 'Destination' 列的示例为 'np.NA'：

{
 'PassengerId': {47: '0045_02',
  128: '0138_02',
  139: '0152_01',
  347: '0382_01',
  430: '0462_01'},
 'HomePlanet': {47: 'Mars',
  128: 'Earth',
  139: 'Earth',
  347: nan,
  430: 'Earth'},
 'CryoSleep': {47: 1, 128: 0, 139: 0, 347: 0, 430: 1},
 'Destination': {47: nan, 128: nan, 139: nan, 347: nan, 430: nan},
 'Age': {47: 19.0, 128: 34.0, 139: 41.0, 347: 23.0, 430: 50.0},
 'VIP': {47: 0, 128: 0, 139: 0, 347: 0, 430: 0},
 'RoomService': {47: 0.0, 128: 0.0, 139: 0.0, 347: 348.0, 430: 0.0},
 'FoodCourt': {47: 0.0, 128: 22.0, 139: 0.0, 347: 0.0, 430: 0.0},
 'ShoppingMall': {47: 0.0, 128: 0.0, 139: 0.0, 347: 0.0, 430: 0.0},
 'Spa': {47: 0.0, 128: 564.0, 139: 0.0, 347: 4.0, 430: 0.0},
 'VRDeck': {47: 0.0, 128: 207.0, 139: 607.0, 347: 368.0, 430: 0.0},
 'Name': {47: 'Mass Chmad',
  128: 'Monah Gambs',
  139: 'Andan Estron',
  347: 'Blanie Floydendley',
  430: 'Ronia Sosanturney'},
 'Transported': {47: 1, 128: 0, 139: 0, 347: 0, 430: 0},
 'Cabin_letter': {47: 'F', 128: 'E', 139: 'F', 347: 'G', 430: 'G'},
 'Cabin_number': {47: '10', 128: '5', 139: '32', 347: '64', 430: '67'},
 'Cabin_letter_2': {47: 'P', 128: 'P', 139: 'P', 347: 'P', 430: 'S'}
}

# 我的代码：

imputer = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')
X_train[X_train_objects] = imputer.fit_transform(X_train[X_train_objects].values.reshape(-1,1))[:,0]

希望这有帮助。如果你需要进一步的解释或帮助，请告诉我。

英文:

I have a list of columns of a dataframe that have NA's in them (below). The dtype of all these columns is str.

X_train_objects = [&#39;HomePlanet&#39;,
&#39;Destination&#39;,
&#39;Name&#39;,
&#39;Cabin_letter&#39;,
&#39;Cabin_number&#39;,
&#39;Cabin_letter_2&#39;]

I would like to use SimpleImputer to fill in the NA's will the most common value (mode). However, I am getting a ValueError: Columns must be same length as key. What is the reason for this, my code seems correct to me?

Dataframe sample (called X_train) of the Destination column being np.NAs:

{&#39;PassengerId&#39;: {47: &#39;0045_02&#39;,
128: &#39;0138_02&#39;,
139: &#39;0152_01&#39;,
347: &#39;0382_01&#39;,
430: &#39;0462_01&#39;},
&#39;HomePlanet&#39;: {47: &#39;Mars&#39;,
128: &#39;Earth&#39;,
139: &#39;Earth&#39;,
347: nan,
430: &#39;Earth&#39;},
&#39;CryoSleep&#39;: {47: 1, 128: 0, 139: 0, 347: 0, 430: 1},
&#39;Destination&#39;: {47: nan, 128: nan, 139: nan, 347: nan, 430: nan},
&#39;Age&#39;: {47: 19.0, 128: 34.0, 139: 41.0, 347: 23.0, 430: 50.0},
&#39;VIP&#39;: {47: 0, 128: 0, 139: 0, 347: 0, 430: 0},
&#39;RoomService&#39;: {47: 0.0, 128: 0.0, 139: 0.0, 347: 348.0, 430: 0.0},
&#39;FoodCourt&#39;: {47: 0.0, 128: 22.0, 139: 0.0, 347: 0.0, 430: 0.0},
&#39;ShoppingMall&#39;: {47: 0.0, 128: 0.0, 139: 0.0, 347: 0.0, 430: 0.0},
&#39;Spa&#39;: {47: 0.0, 128: 564.0, 139: 0.0, 347: 4.0, 430: 0.0},
&#39;VRDeck&#39;: {47: 0.0, 128: 207.0, 139: 607.0, 347: 368.0, 430: 0.0},
&#39;Name&#39;: {47: &#39;Mass Chmad&#39;,
128: &#39;Monah Gambs&#39;,
139: &#39;Andan Estron&#39;,
347: &#39;Blanie Floydendley&#39;,
430: &#39;Ronia Sosanturney&#39;},
&#39;Transported&#39;: {47: 1, 128: 0, 139: 0, 347: 0, 430: 0},
&#39;Cabin_letter&#39;: {47: &#39;F&#39;, 128: &#39;E&#39;, 139: &#39;F&#39;, 347: &#39;G&#39;, 430: &#39;G&#39;},
&#39;Cabin_number&#39;: {47: &#39;10&#39;, 128: &#39;5&#39;, 139: &#39;32&#39;, 347: &#39;64&#39;, 430: &#39;67&#39;},
&#39;Cabin_letter_2&#39;: {47: &#39;P&#39;, 128: &#39;P&#39;, 139: &#39;P&#39;, 347: &#39;P&#39;, 430: &#39;S&#39;}}

My Code:

imputer = SimpleImputer(missing_values=np.NaN, strategy=&#39;most_frequent&#39;)
X_train[X_train_objects] = imputer.fit_transform(X_train[X_train_objects].values.reshape(-1,1))[:,0]

答案1

得分: 2

更新：

根据提问者的反馈，能够得到所需结果的策略是执行以下操作：

X_train[X_train_objects] = imputer.fit_transform(X_train[X_train_objects].values)

原始答案：

你的问题中的代码实际执行以下操作：

处理 X_train[X_train_objects]，其形状为 (5, 6)
通过 values 将其转换为一个NumPy数组，并使用 .reshape(-1,1)[:,0] 将其变为一个长度为30的一维数组
将这个一维数组作为参数传递给 imputer.fit_transform，它返回的结果的形状与其输入相同
尝试使用这个长度为30的一维数组来更新 X_train[X_train_objects] 中的所有行，而 X_train[X_train_objects]（如前所述）的形状为 (5, 6)，具体来说，它只有6列

这会导致错误：ValueError: Columns must be same length as key

我认为你的意图是，在处理了最初在 X_train[X_train_objects] 中找到的值之后，通过使用处理后的值来更新原始对象，覆盖原始值。为了实现这一目标，我认为以下代码应该有效：

X_train[X_train_objects] = (
    imputer.fit_transform(X_train[X_train_objects].values.reshape(-1,1))[:,0]
    .reshape(-1,len(X_train_objects)) )

英文:

UPDATE:

Based on feedback from OP, the strategy that gives the desired result is to do this:

X_train[X_train_objects] = imputer.fit_transform(X_train[X_train_objects].values)

ORIGINAL ANSWER:

Here's what the code in your question does:

works with X_train[X_train_objects], which has shape (5, 6)
converts it to a numpy array (via values) and changes it to a 1D array of length 30 using .reshape(-1,1)[:,0]
passes this as an argument to imputer.fit_transform which returns a result whose shape is the same as its input
attempts to use this 1D array of length 30 to update all rows in X_train[X_train_objects] which (as mentioned above) has shape (5, 6), or specifically, has only 6 columns

This gives rise to the error: ValueError: Columns must be same length as key

What I believe you intend is, having massaged the values originally found in X_train[X_train_objects], to update the original object by overwriting the original values with the massaged ones. To do this, I think the following should work:

X_train[X_train_objects] = (
    imputer.fit_transform(X_train[X_train_objects].values.reshape(-1,1))[:,0]
    .reshape(-1,len(X_train_objects)) )

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

sklearn.impute.SimpleImputer: 无法填充数据框列的最常见值

问题

答案1

你可以使用特定的方式对数据集的列进行排序，以展示它们的分布。

KIVY: 按钮背景尽管指定了源但未显示

Python – 基于相似度超过80的结果，在新列中为类别进行模糊匹配

Matplotlib子图图例在包含许多元素时与图表重叠。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论