英文:
TFIDFVectorizer making concatenated word tokens
问题
我正在使用Cranfield数据集来创建一个索引器和查询处理器。为此,我使用TFIDFVectorizer来对数据进行标记化。但在使用TFIDFVectorizer后,当我检查词汇表时,会发现很多标记是由两个单词的连接形成的。
我使用以下代码来实现它:
import re
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
# 读取数据
with open('cran.all', 'r') as f:
content_string = ""
content =
content = content_string.join(content)
doc = re.split('.I\s[0-9]{1,4}', content)
f.close()
# 一些数据清理
doc =
del doc[0]
doc = [re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), stop_words=text.ENGLISH_STOP_WORDS, lowercase=True)
X = vectorizer.fit_transform(doc)
print(vectorizer.vocabulary_)
当我打印词汇表时,以下是我获取的一些示例:
'freevibration': 7222, 'slendersharp': 15197, 'frequentlyapproximated': 7249, 'notapplicable': 11347, 'rateof': 13727, 'itsvalue': 9443, 'speedflow': 15516, 'movingwith': 11001, 'speedsolution': 15531, 'centerof': 3314, 'hypersoniclow': 8230, 'neice': 11145, 'rutkowski': 14444, 'chann': 3381, 'layerapproximations': 9828, 'probsteinhave': 13353, 'thishypersonic': 17752
当我使用小数据时,不会发生这种情况。如何防止这种情况发生?
英文:
I am using the Cranfield Dataset to make an Indexer and Query Processor. For that purpose I am using TFIDFVectorizer to tokenize the data. But after using TFIDFVectorizer when I check the vocabulary,there were lot of tokens formed using a concatenation of two words.
I am using the following code to achieve it:
import re
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
#reading the data
with open('cran.all', 'r') as f:
content_string=""
content =
content = content_string.join(content)
doc=re.split('.I\s[0-9]{1,4}',content)
f.close()
#some data cleaning
doc =
del doc[0]
doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]
vectorizer = TfidfVectorizer(analyzer ='word', ngram_range=(1,1), stop_words=text.ENGLISH_STOP_WORDS,lowercase=True)
X = vectorizer.fit_transform(doc)
print(vectorizer.vocabulary_)
I have attached below a few examples I obtain when I print vocabulary:
'freevibration': 7222, 'slendersharp': 15197, 'frequentlyapproximated': 7249, 'notapplicable': 11347, 'rateof': 13727, 'itsvalue': 9443, 'speedflow': 15516, 'movingwith': 11001, 'speedsolution': 15531, 'centerof': 3314, 'hypersoniclow': 8230, 'neice': 11145, 'rutkowski': 14444, 'chann': 3381, 'layerapproximations': 9828, 'probsteinhave': 13353, 'thishypersonic': 17752
When I use with small data, it does not happen. How to prevent this from happening?
答案1
得分: 1
以下是您要翻译的内容:
My guess would be that the issue is caused by this line:
content =
When replacing line breaks, the last word of line 1 is concatenated with the first word of line 2. And of course this happens for every line, so you get a lot of these. The solution is super simple: instead of replacing line break with nothing (i.e. just removing them), replace them with a whitespace:
content =
英文:
My guess would be that the issue is caused by this line:
content =
When replacing line breaks, the last word of line 1 is concatenated with the first word of line 2. And of course this happens for every line, so you get a lot of these. The solution is super simple: instead of replacing line break with nothing (i.e. just removing them), replace them with a whitespace:
content =
---
(note the space between ''
)
答案2
得分: -1
使用自定义标记器以避免将两个单词合并为一个。
def custom_tokenizer(text):
return text.split()
with open('cran.all', 'r') as f:
content_string=""
content =
content = content_string.join(content)
doc=re.split('.I\s[0-9]{1,4}',content)
f.close()
doc =
del doc[0]
doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]
vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer, stop_words=text.ENGLISH_STOP_WORDS, lowercase=True)
X = vectorizer.fit_transform(doc)
英文:
Use custom tokenzier in orrder to avoid merging two words into one.
def custom_tokenizer(text):
return text.split()
with open('cran.all', 'r') as f:
content_string=""
content =
content = content_string.join(content)
doc=re.split('.I\s[0-9]{1,4}',content)
f.close()
doc =
del doc[0]
doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]
vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer, stop_words=text.ENGLISH_STOP_WORDS, lowercase=True)
X = vectorizer.fit_transform(doc)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论