2023年2月19日 03:03:13go评论95阅读模式

英文:

TFIDFVectorizer making concatenated word tokens

问题

我正在使用Cranfield数据集来创建一个索引器和查询处理器。为此，我使用TFIDFVectorizer来对数据进行标记化。但在使用TFIDFVectorizer后，当我检查词汇表时，会发现很多标记是由两个单词的连接形成的。

我使用以下代码来实现它：

import re
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
# 读取数据
with open('cran.all', 'r') as f:
    content_string = ""
    content = 
    content = content_string.join(content)
    doc = re.split('.I\s[0-9]{1,4}', content)
    f.close()
# 一些数据清理
doc = 
del doc[0]
doc = [re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), stop_words=text.ENGLISH_STOP_WORDS, lowercase=True)
X = vectorizer.fit_transform(doc)
print(vectorizer.vocabulary_)

当我打印词汇表时，以下是我获取的一些示例：

'freevibration': 7222, 'slendersharp': 15197, 'frequentlyapproximated': 7249, 'notapplicable': 11347, 'rateof': 13727, 'itsvalue': 9443, 'speedflow': 15516, 'movingwith': 11001, 'speedsolution': 15531, 'centerof': 3314, 'hypersoniclow': 8230, 'neice': 11145, 'rutkowski': 14444, 'chann': 3381, 'layerapproximations': 9828, 'probsteinhave': 13353, 'thishypersonic': 17752

当我使用小数据时，不会发生这种情况。如何防止这种情况发生？

英文:

I am using the Cranfield Dataset to make an Indexer and Query Processor. For that purpose I am using TFIDFVectorizer to tokenize the data. But after using TFIDFVectorizer when I check the vocabulary,there were lot of tokens formed using a concatenation of two words.

I am using the following code to achieve it:

import re
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
#reading the data
with open(&#39;cran.all&#39;, &#39;r&#39;) as f:
    content_string=&quot;&quot;
    content = 
    content =  content_string.join(content)
    doc=re.split(&#39;.I\s[0-9]{1,4}&#39;,content)
    f.close()
#some data cleaning
doc = 
del doc[0]
doc= [ re.sub(&#39;[^A-Za-z]+&#39;, &#39; &#39;, lines) for lines in doc]
vectorizer = TfidfVectorizer(analyzer =&#39;word&#39;, ngram_range=(1,1), stop_words=text.ENGLISH_STOP_WORDS,lowercase=True)
X = vectorizer.fit_transform(doc)
print(vectorizer.vocabulary_)

I have attached below a few examples I obtain when I print vocabulary:

'freevibration': 7222, 'slendersharp': 15197, 'frequentlyapproximated': 7249, 'notapplicable': 11347, 'rateof': 13727, 'itsvalue': 9443, 'speedflow': 15516, 'movingwith': 11001, 'speedsolution': 15531, 'centerof': 3314, 'hypersoniclow': 8230, 'neice': 11145, 'rutkowski': 14444, 'chann': 3381, 'layerapproximations': 9828, 'probsteinhave': 13353, 'thishypersonic': 17752

When I use with small data, it does not happen. How to prevent this from happening?

答案1

得分: 1

以下是您要翻译的内容：

My guess would be that the issue is caused by this line:

content =

When replacing line breaks, the last word of line 1 is concatenated with the first word of line 2. And of course this happens for every line, so you get a lot of these. The solution is super simple: instead of replacing line break with nothing (i.e. just removing them), replace them with a whitespace:

content =

英文:

My guess would be that the issue is caused by this line:

content =

content = 
                             ---

(note the space between '')

答案2

得分: -1

使用自定义标记器以避免将两个单词合并为一个。

def custom_tokenizer(text):
    return text.split()
    
with open('cran.all', 'r') as f:
    content_string=""
    content = 
    content = content_string.join(content)
    doc=re.split('.I\s[0-9]{1,4}',content)
    f.close()
doc = 
del doc[0]
doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]
vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer, stop_words=text.ENGLISH_STOP_WORDS, lowercase=True)
X = vectorizer.fit_transform(doc)

英文:

Use custom tokenzier in orrder to avoid merging two words into one.

def custom_tokenizer(text):
    return text.split()
with open(&#39;cran.all&#39;, &#39;r&#39;) as f:
    content_string=&quot;&quot;
    content = 
    content =  content_string.join(content)
    doc=re.split(&#39;.I\s[0-9]{1,4}&#39;,content)
    f.close()
doc = 
del doc[0]
doc= [ re.sub(&#39;[^A-Za-z]+&#39;, &#39; &#39;, lines) for lines in doc]
vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer, stop_words=text.ENGLISH_STOP_WORDS, lowercase=True)
X = vectorizer.fit_transform(doc)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

TFIDFVectorizer 制作拼接的单词标记

问题

答案1

答案2

ruamel.yaml.representer.RepresenterError – 为什么 ruamel.yaml 不能表示一个 np.array？

添加 getitem 访问器到 Python 类方法

奇怪的结果，将数据框导入到 InfluxDB 2.x 时发生。

current user.is_authenticated gives me a True value but it didn’t work in base.html file.

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。