英文:
Using simple linear regression for multiple classification task
问题
我有一个包含22000个条目的语言分类数据集,每种语言有1000个条目。
请问如何使用简单的线性回归编写分类模型,使其不是一个模型选择0、1、2、… 22个值,而是22个模型在1和0之间进行选择(正确和不正确)。如何更好地重写我的目标变量y?
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')
label_encoder = preprocessing.LabelEncoder()
df['language'] = label_encoder.fit_transform(df['language'])
x = np.array(df['Text'])
y = np.array(df['language'])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)
model = LinearRegression()
model.fit(X_train, y_train)
英文:
I have a language classification dataset with 22000 entries, 1000 for each of 22 languages.
Can someone please advise how could I write classification model using simple linear regression, so it would be not one model picking of 0, 1, 2, … 22 values, but it would be 22 models picking between 1 and 0 (correct and incorrect). How is better to rewrite my y target?
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')
label_encoder = preprocessing.LabelEncoder()
df['language']= label_encoder.fit_transform(df['language'])
x = np.array(df['Text'])
y = np.array(df['language'])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)
model = LinearRegression()
model.fit(X_train, y_train)
答案1
得分: 1
你正在描述一对多(OvR)多类别分类。要能够实现这一点,你需要对"language"
列进行独热编码,然后遍历新列,为每个虚拟列拟合一个模型:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')
# 独热编码
y = pd.get_dummies(df['language'])
print(y.columns.tolist())
x = np.array(df['Text'])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)
# 现在你将拥有22个模型,每个语言一个
models = []
scores = []
for language in y.columns:
model = LinearRegression()
model.fit(X_train, y_train[language])
score = model.score(X_test, y_test[language])
models.append(model)
scores.append(score)
# 现在,'scores'是模型的R^2分数列表
for language, score in zip(y.columns, scores):
print(f"语言 '{language}' 的模型 R^2 分数:{score}")
这个方法是有效的,但我建议你使用LogisticRegression
而不是LinearRegression
,因为LinearRegression
不会限制在值0
和1
之间。
改用LogisticRegression
,它将只估计0
和1
,即它是特定语言还是不是。此外,你还可以计算它是否是特定语言的概率。
而不是拟合多个模型,你可以使用OneVsRestClassifier
将你的分类器包装起来。
英文:
You are describing one-vs-rest (OvR) multiclass classification. To be able to do this, you need to one-hot-encode the column "language"
and then iterate over the new columns fitting one column for each dummy-column:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')
# One-hot encoding
y = pd.get_dummies(df['language'])
print(y.columns.tolist())
x = np.array(df['Text'])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)
# You'll now have 22 models, one for each language
models = []
scores = []
for language in y.columns:
model = LinearRegression()
model.fit(X_train, y_train[language])
score = model.score(X_test, y_test[language])
models.append(model)
scores.append(score)
# Now, 'scores' is a list of R^2 scores of the models
for language, score in zip(y.columns, scores):
print(f"Model for language '{language}' R^2 score: {score}")
This works, but I advise you to use LogisticRegression
instead of LinearRegression
, because LinearRegression
is not bound the values 0
and 1
.
Use LogisticRegression
instead. It will only estimate 0
and 1
, i.e. is it that specific language or not. Additionally you can calculate the probabilities if it is that specific language or not.
Instead of fitting multiple models, you could wrap your classifier with OneVsRestClassifier
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论