使用简单线性回归进行多分类任务。

huangapple go评论75阅读模式
英文:

Using simple linear regression for multiple classification task

问题

我有一个包含22000个条目的语言分类数据集,每种语言有1000个条目。
请问如何使用简单的线性回归编写分类模型,使其不是一个模型选择0、1、2、… 22个值,而是22个模型在1和0之间进行选择(正确和不正确)。如何更好地重写我的目标变量y?

import numpy as np 
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')
label_encoder = preprocessing.LabelEncoder()
df['language'] = label_encoder.fit_transform(df['language'])
x = np.array(df['Text'])
y = np.array(df['language'])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)
model = LinearRegression()
model.fit(X_train, y_train)
英文:

I have a language classification dataset with 22000 entries, 1000 for each of 22 languages.
Can someone please advise how could I write classification model using simple linear regression, so it would be not one model picking of 0, 1, 2, … 22 values, but it would be 22 models picking between 1 and 0 (correct and incorrect). How is better to rewrite my y target?

import numpy as np 
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')
label_encoder = preprocessing.LabelEncoder()
df['language']= label_encoder.fit_transform(df['language'])
x = np.array(df['Text'])
y = np.array(df['language'])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)
model = LinearRegression()
model.fit(X_train, y_train)

答案1

得分: 1

你正在描述一对多(OvR)多类别分类。要能够实现这一点,你需要对"language"列进行独热编码,然后遍历新列,为每个虚拟列拟合一个模型:

import numpy as np 
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')

# 独热编码
y = pd.get_dummies(df['language'])

print(y.columns.tolist())

x = np.array(df['Text'])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)

# 现在你将拥有22个模型,每个语言一个
models = []
scores = []
for language in y.columns:
    model = LinearRegression()
    model.fit(X_train, y_train[language])
    score = model.score(X_test, y_test[language])
    models.append(model)
    scores.append(score)

# 现在,'scores'是模型的R^2分数列表
for language, score in zip(y.columns, scores):
    print(f"语言 '{language}' 的模型 R^2 分数:{score}")

这个方法是有效的,但我建议你使用LogisticRegression而不是LinearRegression,因为LinearRegression不会限制在值01之间。

改用LogisticRegression,它将只估计01,即它是特定语言还是不是。此外,你还可以计算它是否是特定语言的概率。

而不是拟合多个模型,你可以使用OneVsRestClassifier将你的分类器包装起来。

英文:

You are describing one-vs-rest (OvR) multiclass classification. To be able to do this, you need to one-hot-encode the column "language" and then iterate over the new columns fitting one column for each dummy-column:

import numpy as np 
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')

# One-hot encoding
y = pd.get_dummies(df['language'])

print(y.columns.tolist())

x = np.array(df['Text'])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)

# You'll now have 22 models, one for each language
models = []
scores = []
for language in y.columns:
    model = LinearRegression()
    model.fit(X_train, y_train[language])
    score = model.score(X_test, y_test[language])
    models.append(model)
    scores.append(score)

# Now, 'scores' is a list of R^2 scores of the models
for language, score in zip(y.columns, scores):
    print(f"Model for language '{language}' R^2 score: {score}")

This works, but I advise you to use LogisticRegression instead of LinearRegression, because LinearRegression is not bound the values 0 and 1.

Use LogisticRegression instead. It will only estimate 0 and 1, i.e. is it that specific language or not. Additionally you can calculate the probabilities if it is that specific language or not.

Instead of fitting multiple models, you could wrap your classifier with OneVsRestClassifier.

huangapple
  • 本文由 发表于 2023年5月17日 16:46:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76270164.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定