英文:
Pandas Logistic Regression mixed type not supported?
问题
我正在努力使用Python创建一个带有简单数据集的逻辑回归模型:
我的目标是预测某人是否幸存。在清理数据集并且去除NaN值以及字符串列之后,我使用了以下代码将每一列的数据类型转换为float64(下面也显示了清理后的数据集):
titanic_data['Survived'] = titanic_data['Survived'].astype(float)
titanic_data['Sibling/Spouse'] = titanic_data['Sibling/Spouse'].astype(float)
titanic_data['Parents/Children'] = titanic_data['Parents/Children'].astype(float)
titanic_data['male'] = titanic_data['male'].astype(float)
titanic_data['Q'] = titanic_data['Q'].astype(float)
titanic_data['S'] = titanic_data['S'].astype(float)
titanic_data[2] = titanic_data[2].astype(float)
titanic_data[3] = titanic_data[3].astype(float)
上述代码的输出:
Survived float64
Age float64
Sibling/Spouse float64
Parents/Children float64
Fare float64
male float64
Q float64
S float64
2 float64
3 float64
dtype: object
当我运行我的逻辑回归代码(如下所示),我收到错误消息“不支持混合字符串和非字符串的类型”。
我的逻辑回归代码:
# 逻辑回归
# 拆分数据集
x = titanic_data.drop("Survived", axis=1)
y = titanic_data["Survived"]
from sklearn.model_selection import train_test_split
x_train, y_train, x_test, y_test = train_test_split(x, y, test_size=0.3, random_state=1)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
但正如你所看到的,我已经将所有列的数据类型更改为相同,那么为什么我会收到这个错误消息,我应该如何修复它?
编辑:我收到的错误消息如下图所示:
英文:
I'm working on making a logistic regression with a simple dataset in Python:
My goal is to predict whether or not someone survived.
After cleaning the dataset & getting rid of NaN values as well as String columns, I've used the following code to turn every column data type to float64(cleaned dataset shown below as well):
titanic_data['Survived'] = titanic_data['Survived'].astype(float)
titanic_data['Sibling/Spouse'] = titanic_data['Sibling/Spouse'].astype(float)
titanic_data['Parents/Children'] = titanic_data['Parents/Children'].astype(float)
titanic_data['male'] = titanic_data['male'].astype(float)
titanic_data['Q'] = titanic_data['Q'].astype(float)
titanic_data['S'] = titanic_data['S'].astype(float)
titanic_data[2] = titanic_data[2].astype(float)
titanic_data[3] = titanic_data[3].astype(float)
Output of the above code:
Survived float64
Age float64
Sibling/Spouse float64
Parents/Children float64
Fare float64
male float64
Q float64
S float64
2 float64
3 float64
dtype: object
When I run my Logistic Regression code(shown below), I get the error mixed type of string and non-string is not supported.
My Regression code:
# Logistic regression
# Split the dataset
x = titanic_data.drop("Survived",axis=1)
y = titanic_data["Survived"]
from sklearn.model_selection import train_test_split
x_train, y_train, x_test, y_test = train_test_split(x,y,test_size=0.3,random_state=1)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
But as you can see, I've changed my column data types to all be the same, so why am I getting this error & what can I do to fix it?
答案1
得分: 2
你看到的错误与列内容无关,而与列名有关。注意不要用非字符串命名列(例如用于分位数标记或独热编码级别的0/1/2/3)。Sklearn的sanity checks期望列名是字符串。为了安全起见,请尝试
X.columns = X.columns.astype(str)
要避免这种问题(而不是事后修复),请使用更规范的方法来操作和编码数据,比如pd.get_dummies
或其他方式。下面是一个完全工作的示例:
# 获取泰坦尼克号数据
from sklearn.datasets import fetch_openml
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True,)
dropped_cols = ['boat', 'body', 'home.dest', 'name', 'cabin', 'embarked', 'ticket']
X.drop(dropped_cols, axis=1, inplace=True)
# 编码(对于类别使用独热编码)并输入(简单处理缺失值)
import pandas as pd
X = pd.get_dummies(X,columns=['sex', 'pclass'], drop_first=True)
y = y.astype(float)
X = X.fillna(0)
# 逻辑回归
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X, y)
logreg.score(X,y) # 0.7868601986249045
在这里,get_dummies
方法使用前缀对列进行了独热编码命名,因此保持了适当的字符串类型。X.columns
如下所示:
Index(['age', 'sibsp', 'parch', 'fare', 'sex_male', 'pclass_2.0',
'pclass_3.0'],
dtype='object')
英文:
The error you are seeing is not about the column content, but about column names. Beware of naming columns with non-strings (e.g. 0/1/2/3 for quantile markers or one-hot-encoded levels). Sklearn's sanity checks expect that column names are strings. For safety, try
X.columns = X.columns.astype(str)
To avoid such problems (rather than fixing afterwards), use more canonical ways to manipulate and encode data, like pd.get_dummies
or others. Here is a fully working example:
# Fetch Titanic
from sklearn.datasets import fetch_openml
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True,)
dropped_cols = ['boat', 'body', 'home.dest', 'name', 'cabin', 'embarked', 'ticket']
X.drop(dropped_cols, axis=1, inplace=True)
# Encode (one-hot for categories) & inpute (naive)
import pandas as pd
X = pd.get_dummies(X,columns=['sex', 'pclass'], drop_first=True)
y = y.astype(float)
X = X.fillna(0)
# Logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X, y)
logreg.score(X,y) # 0.7868601986249045
Here get_dummies
method did one-hot-encoding naming columns with prefixes, hence maintaining the proper string type. X.columns
looks as below:
Index(['age', 'sibsp', 'parch', 'fare', 'sex_male', 'pclass_2.0',
'pclass_3.0'],
dtype='object')
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论