2023年5月28日 13:32:50go评论167阅读模式

英文:

Pandas Logistic Regression mixed type not supported?

问题

我正在努力使用Python创建一个带有简单数据集的逻辑回归模型：

我的目标是预测某人是否幸存。在清理数据集并且去除NaN值以及字符串列之后，我使用了以下代码将每一列的数据类型转换为float64（下面也显示了清理后的数据集）：

titanic_data['Survived'] = titanic_data['Survived'].astype(float)
titanic_data['Sibling/Spouse'] = titanic_data['Sibling/Spouse'].astype(float)
titanic_data['Parents/Children'] = titanic_data['Parents/Children'].astype(float)
titanic_data['male'] = titanic_data['male'].astype(float)
titanic_data['Q'] = titanic_data['Q'].astype(float)
titanic_data['S'] = titanic_data['S'].astype(float)
titanic_data[2] = titanic_data[2].astype(float)
titanic_data[3] = titanic_data[3].astype(float)

上述代码的输出：

Survived            float64
Age                 float64
Sibling/Spouse      float64
Parents/Children    float64
Fare                float64
male                float64
Q                   float64
S                   float64
2                   float64
3                   float64
dtype: object

当我运行我的逻辑回归代码（如下所示），我收到错误消息“不支持混合字符串和非字符串的类型”。

我的逻辑回归代码：

# 逻辑回归
# 拆分数据集

x = titanic_data.drop("Survived", axis=1)
y = titanic_data["Survived"]

from sklearn.model_selection import train_test_split
x_train, y_train, x_test, y_test = train_test_split(x, y, test_size=0.3, random_state=1)

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(x_train, y_train)

但正如你所看到的，我已经将所有列的数据类型更改为相同，那么为什么我会收到这个错误消息，我应该如何修复它？

编辑：我收到的错误消息如下图所示：

英文:

I'm working on making a logistic regression with a simple dataset in Python:

My goal is to predict whether or not someone survived.
After cleaning the dataset & getting rid of NaN values as well as String columns, I've used the following code to turn every column data type to float64(cleaned dataset shown below as well):

titanic_data[&#39;Survived&#39;] = titanic_data[&#39;Survived&#39;].astype(float)
titanic_data[&#39;Sibling/Spouse&#39;] = titanic_data[&#39;Sibling/Spouse&#39;].astype(float)
titanic_data[&#39;Parents/Children&#39;] = titanic_data[&#39;Parents/Children&#39;].astype(float)
titanic_data[&#39;male&#39;] = titanic_data[&#39;male&#39;].astype(float)
titanic_data[&#39;Q&#39;] = titanic_data[&#39;Q&#39;].astype(float)
titanic_data[&#39;S&#39;] = titanic_data[&#39;S&#39;].astype(float)
titanic_data[2] = titanic_data[2].astype(float)
titanic_data[3] = titanic_data[3].astype(float)

Output of the above code:

Survived            float64
Age                 float64
Sibling/Spouse      float64
Parents/Children    float64
Fare                float64
male                float64
Q                   float64
S                   float64
2                   float64
3                   float64
dtype: object

When I run my Logistic Regression code(shown below), I get the error mixed type of string and non-string is not supported.

My Regression code:

# Logistic regression
# Split the dataset

x = titanic_data.drop(&quot;Survived&quot;,axis=1)
y = titanic_data[&quot;Survived&quot;]

from sklearn.model_selection import train_test_split
x_train, y_train, x_test, y_test = train_test_split(x,y,test_size=0.3,random_state=1)

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(x_train, y_train)

But as you can see, I've changed my column data types to all be the same, so why am I getting this error & what can I do to fix it?

EDIT: The error message I got:

答案1

得分: 2

你看到的错误与列内容无关，而与列名有关。注意不要用非字符串命名列（例如用于分位数标记或独热编码级别的0/1/2/3）。Sklearn的sanity checks期望列名是字符串。为了安全起见，请尝试

X.columns = X.columns.astype(str)

要避免这种问题（而不是事后修复），请使用更规范的方法来操作和编码数据，比如pd.get_dummies或其他方式。下面是一个完全工作的示例：


# 获取泰坦尼克号数据

from sklearn.datasets import fetch_openml
X, y = fetch_openml(&quot;titanic&quot;, version=1, as_frame=True, return_X_y=True,)
dropped_cols = [&#39;boat&#39;, &#39;body&#39;, &#39;home.dest&#39;, &#39;name&#39;, &#39;cabin&#39;, &#39;embarked&#39;, &#39;ticket&#39;]
X.drop(dropped_cols, axis=1, inplace=True)

# 编码（对于类别使用独热编码）并输入（简单处理缺失值）

import pandas as pd
X = pd.get_dummies(X,columns=[&#39;sex&#39;, &#39;pclass&#39;], drop_first=True)
y = y.astype(float)
X = X.fillna(0)

# 逻辑回归

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X, y)
logreg.score(X,y) # 0.7868601986249045

在这里，get_dummies方法使用前缀对列进行了独热编码命名，因此保持了适当的字符串类型。X.columns如下所示：

Index([&#39;age&#39;, &#39;sibsp&#39;, &#39;parch&#39;, &#39;fare&#39;, &#39;sex_male&#39;, &#39;pclass_2.0&#39;,
       &#39;pclass_3.0&#39;],
      dtype=&#39;object&#39;)

英文:

The error you are seeing is not about the column content, but about column names. Beware of naming columns with non-strings (e.g. 0/1/2/3 for quantile markers or one-hot-encoded levels). Sklearn's sanity checks expect that column names are strings. For safety, try

X.columns = X.columns.astype(str)

To avoid such problems (rather than fixing afterwards), use more canonical ways to manipulate and encode data, like pd.get_dummies or others. Here is a fully working example:


# Fetch Titanic

from sklearn.datasets import fetch_openml
X, y = fetch_openml(&quot;titanic&quot;, version=1, as_frame=True, return_X_y=True,)
dropped_cols = [&#39;boat&#39;, &#39;body&#39;, &#39;home.dest&#39;, &#39;name&#39;, &#39;cabin&#39;, &#39;embarked&#39;, &#39;ticket&#39;]
X.drop(dropped_cols, axis=1, inplace=True)

# Encode (one-hot for categories) &amp; inpute (naive)

import pandas as pd
X = pd.get_dummies(X,columns=[&#39;sex&#39;, &#39;pclass&#39;], drop_first=True)
y = y.astype(float)
X = X.fillna(0)

# Logistic regression

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X, y)
logreg.score(X,y) # 0.7868601986249045

Here get_dummies method did one-hot-encoding naming columns with prefixes, hence maintaining the proper string type. X.columns looks as below:

Index([&#39;age&#39;, &#39;sibsp&#39;, &#39;parch&#39;, &#39;fare&#39;, &#39;sex_male&#39;, &#39;pclass_2.0&#39;,
       &#39;pclass_3.0&#39;],
      dtype=&#39;object&#39;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas逻辑回归不支持混合类型？

问题

答案1

打开 Python 中的文本文件

从现有的数据框中找到唯一日期，并创建一个带有相应列值的新CSV。

使用正则表达式替换Python字典中的元素

条件数学运算与pandas数据框中的列

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论