TFIDFVectorizer 制作拼接的单词标记

huangapple go评论67阅读模式
英文:

TFIDFVectorizer making concatenated word tokens

问题

我正在使用Cranfield数据集来创建一个索引器和查询处理器。为此,我使用TFIDFVectorizer来对数据进行标记化。但在使用TFIDFVectorizer后,当我检查词汇表时,会发现很多标记是由两个单词的连接形成的。

我使用以下代码来实现它:

import re
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer

# 读取数据
with open('cran.all', 'r') as f:
    content_string = ""
    content = 
content = content_string.join(content) doc = re.split('.I\s[0-9]{1,4}', content) f.close() # 一些数据清理 doc =
del doc[0] doc = [re.sub('[^A-Za-z]+', ' ', lines) for lines in doc] vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), stop_words=text.ENGLISH_STOP_WORDS, lowercase=True) X = vectorizer.fit_transform(doc) print(vectorizer.vocabulary_)

当我打印词汇表时,以下是我获取的一些示例:

'freevibration': 7222, 'slendersharp': 15197, 'frequentlyapproximated': 7249, 'notapplicable': 11347, 'rateof': 13727, 'itsvalue': 9443, 'speedflow': 15516, 'movingwith': 11001, 'speedsolution': 15531, 'centerof': 3314, 'hypersoniclow': 8230, 'neice': 11145, 'rutkowski': 14444, 'chann': 3381, 'layerapproximations': 9828, 'probsteinhave': 13353, 'thishypersonic': 17752

当我使用小数据时,不会发生这种情况。如何防止这种情况发生?

英文:

I am using the Cranfield Dataset to make an Indexer and Query Processor. For that purpose I am using TFIDFVectorizer to tokenize the data. But after using TFIDFVectorizer when I check the vocabulary,there were lot of tokens formed using a concatenation of two words.

I am using the following code to achieve it:

import re
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
#reading the data
with open('cran.all', 'r') as f:
    content_string=""
    content = 
content = content_string.join(content) doc=re.split('.I\s[0-9]{1,4}',content) f.close() #some data cleaning doc =
del doc[0] doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc] vectorizer = TfidfVectorizer(analyzer ='word', ngram_range=(1,1), stop_words=text.ENGLISH_STOP_WORDS,lowercase=True) X = vectorizer.fit_transform(doc) print(vectorizer.vocabulary_)

I have attached below a few examples I obtain when I print vocabulary:

'freevibration': 7222, 'slendersharp': 15197, 'frequentlyapproximated': 7249, 'notapplicable': 11347, 'rateof': 13727, 'itsvalue': 9443, 'speedflow': 15516, 'movingwith': 11001, 'speedsolution': 15531, 'centerof': 3314, 'hypersoniclow': 8230, 'neice': 11145, 'rutkowski': 14444, 'chann': 3381, 'layerapproximations': 9828, 'probsteinhave': 13353, 'thishypersonic': 17752

When I use with small data, it does not happen. How to prevent this from happening?

答案1

得分: 1

以下是您要翻译的内容:

My guess would be that the issue is caused by this line:

content = 

When replacing line breaks, the last word of line 1 is concatenated with the first word of line 2. And of course this happens for every line, so you get a lot of these. The solution is super simple: instead of replacing line break with nothing (i.e. just removing them), replace them with a whitespace:

content = 
英文:

My guess would be that the issue is caused by this line:

content = 

When replacing line breaks, the last word of line 1 is concatenated with the first word of line 2. And of course this happens for every line, so you get a lot of these. The solution is super simple: instead of replacing line break with nothing (i.e. just removing them), replace them with a whitespace:

content = 
---

(note the space between '')

答案2

得分: -1

使用自定义标记器以避免将两个单词合并为一个。

def custom_tokenizer(text):
    return text.split()
    
with open('cran.all', 'r') as f:
    content_string=""
    content = 
content = content_string.join(content) doc=re.split('.I\s[0-9]{1,4}',content) f.close() doc =
del doc[0] doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc] vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer, stop_words=text.ENGLISH_STOP_WORDS, lowercase=True) X = vectorizer.fit_transform(doc)
英文:

Use custom tokenzier in orrder to avoid merging two words into one.

def custom_tokenizer(text):
    return text.split()

with open('cran.all', 'r') as f:
    content_string=""
    content = 
content = content_string.join(content) doc=re.split('.I\s[0-9]{1,4}',content) f.close() doc =
del doc[0] doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc] vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer, stop_words=text.ENGLISH_STOP_WORDS, lowercase=True) X = vectorizer.fit_transform(doc)

huangapple
  • 本文由 发表于 2023年2月19日 03:03:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/75495715.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定