问题

我有一个包含22000个条目的语言分类数据集，每种语言有1000个条目。
请问如何使用简单的线性回归编写分类模型，使其不是一个模型选择0、1、2、… 22个值，而是22个模型在1和0之间进行选择（正确和不正确）。如何更好地重写我的目标变量y？

import numpy as np 
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')
label_encoder = preprocessing.LabelEncoder()
df['language'] = label_encoder.fit_transform(df['language'])
x = np.array(df['Text'])
y = np.array(df['language'])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)
model = LinearRegression()
model.fit(X_train, y_train)

英文:

I have a language classification dataset with 22000 entries, 1000 for each of 22 languages.
Can someone please advise how could I write classification model using simple linear regression, so it would be not one model picking of 0, 1, 2, … 22 values, but it would be 22 models picking between 1 and 0 (correct and incorrect). How is better to rewrite my y target?

import numpy as np 
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv(&#39;https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv&#39;)
label_encoder = preprocessing.LabelEncoder()
df[&#39;language&#39;]= label_encoder.fit_transform(df[&#39;language&#39;])
x = np.array(df[&#39;Text&#39;])
y = np.array(df[&#39;language&#39;])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)
model = LinearRegression()
model.fit(X_train, y_train)

答案1

得分: 1

你正在描述一对多（OvR）多类别分类。要能够实现这一点，你需要对"language"列进行独热编码，然后遍历新列，为每个虚拟列拟合一个模型：

import numpy as np 
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')

# 独热编码
y = pd.get_dummies(df['language'])

print(y.columns.tolist())

x = np.array(df['Text'])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)

# 现在你将拥有22个模型，每个语言一个
models = []
scores = []
for language in y.columns:
    model = LinearRegression()
    model.fit(X_train, y_train[language])
    score = model.score(X_test, y_test[language])
    models.append(model)
    scores.append(score)

# 现在，'scores'是模型的R^2分数列表
for language, score in zip(y.columns, scores):
    print(f"语言 '{language}' 的模型 R^2 分数：{score}")

这个方法是有效的，但我建议你使用LogisticRegression而不是LinearRegression，因为LinearRegression不会限制在值0和1之间。

改用LogisticRegression，它将只估计0和1，即它是特定语言还是不是。此外，你还可以计算它是否是特定语言的概率。

而不是拟合多个模型，你可以使用OneVsRestClassifier将你的分类器包装起来。

英文:

You are describing one-vs-rest (OvR) multiclass classification. To be able to do this, you need to one-hot-encode the column "language" and then iterate over the new columns fitting one column for each dummy-column:

import numpy as np 
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = pd.read_csv(&#39;https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv&#39;)

# One-hot encoding
y = pd.get_dummies(df[&#39;language&#39;])

print(y.columns.tolist())

x = np.array(df[&#39;Text&#39;])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)

# You&#39;ll now have 22 models, one for each language
models = []
scores = []
for language in y.columns:
    model = LinearRegression()
    model.fit(X_train, y_train[language])
    score = model.score(X_test, y_test[language])
    models.append(model)
    scores.append(score)

# Now, &#39;scores&#39; is a list of R^2 scores of the models
for language, score in zip(y.columns, scores):
    print(f&quot;Model for language &#39;{language}&#39; R^2 score: {score}&quot;)

This works, but I advise you to use LogisticRegression instead of LinearRegression, because LinearRegression is not bound the values 0 and 1.

Use LogisticRegression instead. It will only estimate 0 and 1, i.e. is it that specific language or not. Additionally you can calculate the probabilities if it is that specific language or not.

Instead of fitting multiple models, you could wrap your classifier with OneVsRestClassifier.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用简单线性回归进行多分类任务。

问题

答案1

如何在Python中表示Unicode编码的字符串？

为什么ttk小部件只在配置更改事件发生后出现，而不是在按下按钮时出现？

在sympy中评估两个ConditionSets是否不相交

Snakemake在一个字典上展开，保留通配符。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论