TFIDFVectorizer 制作拼接的单词标记

huangapple go评论95阅读模式
英文:

TFIDFVectorizer making concatenated word tokens

问题

我正在使用Cranfield数据集来创建一个索引器和查询处理器。为此,我使用TFIDFVectorizer来对数据进行标记化。但在使用TFIDFVectorizer后,当我检查词汇表时,会发现很多标记是由两个单词的连接形成的。

我使用以下代码来实现它:

  1. import re
  2. from sklearn.feature_extraction import text
  3. from sklearn.feature_extraction.text import TfidfVectorizer
  4. from sklearn.metrics.pairwise import cosine_similarity
  5. import numpy as np
  6. from nltk import word_tokenize
  7. from nltk.stem import WordNetLemmatizer
  8. # 读取数据
  9. with open('cran.all', 'r') as f:
  10. content_string = ""
  11. content =
  12. content = content_string.join(content)
  13. doc = re.split('.I\s[0-9]{1,4}', content)
  14. f.close()
  15. # 一些数据清理
  16. doc =
  17. del doc[0]
  18. doc = [re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]
  19. vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), stop_words=text.ENGLISH_STOP_WORDS, lowercase=True)
  20. X = vectorizer.fit_transform(doc)
  21. print(vectorizer.vocabulary_)

当我打印词汇表时,以下是我获取的一些示例:

'freevibration': 7222, 'slendersharp': 15197, 'frequentlyapproximated': 7249, 'notapplicable': 11347, 'rateof': 13727, 'itsvalue': 9443, 'speedflow': 15516, 'movingwith': 11001, 'speedsolution': 15531, 'centerof': 3314, 'hypersoniclow': 8230, 'neice': 11145, 'rutkowski': 14444, 'chann': 3381, 'layerapproximations': 9828, 'probsteinhave': 13353, 'thishypersonic': 17752

当我使用小数据时,不会发生这种情况。如何防止这种情况发生?

英文:

I am using the Cranfield Dataset to make an Indexer and Query Processor. For that purpose I am using TFIDFVectorizer to tokenize the data. But after using TFIDFVectorizer when I check the vocabulary,there were lot of tokens formed using a concatenation of two words.

I am using the following code to achieve it:

  1. import re
  2. from sklearn.feature_extraction import text
  3. from sklearn.feature_extraction.text import TfidfVectorizer
  4. from sklearn.metrics.pairwise import cosine_similarity
  5. import numpy as np
  6. from nltk import word_tokenize
  7. from nltk.stem import WordNetLemmatizer
  8. #reading the data
  9. with open('cran.all', 'r') as f:
  10. content_string=""
  11. content =
  12. content = content_string.join(content)
  13. doc=re.split('.I\s[0-9]{1,4}',content)
  14. f.close()
  15. #some data cleaning
  16. doc =
  17. del doc[0]
  18. doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]
  19. vectorizer = TfidfVectorizer(analyzer ='word', ngram_range=(1,1), stop_words=text.ENGLISH_STOP_WORDS,lowercase=True)
  20. X = vectorizer.fit_transform(doc)
  21. print(vectorizer.vocabulary_)

I have attached below a few examples I obtain when I print vocabulary:

'freevibration': 7222, 'slendersharp': 15197, 'frequentlyapproximated': 7249, 'notapplicable': 11347, 'rateof': 13727, 'itsvalue': 9443, 'speedflow': 15516, 'movingwith': 11001, 'speedsolution': 15531, 'centerof': 3314, 'hypersoniclow': 8230, 'neice': 11145, 'rutkowski': 14444, 'chann': 3381, 'layerapproximations': 9828, 'probsteinhave': 13353, 'thishypersonic': 17752

When I use with small data, it does not happen. How to prevent this from happening?

答案1

得分: 1

以下是您要翻译的内容:

My guess would be that the issue is caused by this line:

  1. content =

When replacing line breaks, the last word of line 1 is concatenated with the first word of line 2. And of course this happens for every line, so you get a lot of these. The solution is super simple: instead of replacing line break with nothing (i.e. just removing them), replace them with a whitespace:

  1. content =
英文:

My guess would be that the issue is caused by this line:

  1. content =

When replacing line breaks, the last word of line 1 is concatenated with the first word of line 2. And of course this happens for every line, so you get a lot of these. The solution is super simple: instead of replacing line break with nothing (i.e. just removing them), replace them with a whitespace:

  1. content =
  2. ---

(note the space between '')

答案2

得分: -1

使用自定义标记器以避免将两个单词合并为一个。

  1. def custom_tokenizer(text):
  2. return text.split()
  3. with open('cran.all', 'r') as f:
  4. content_string=""
  5. content =
  6. content = content_string.join(content)
  7. doc=re.split('.I\s[0-9]{1,4}',content)
  8. f.close()
  9. doc =
  10. del doc[0]
  11. doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]
  12. vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer, stop_words=text.ENGLISH_STOP_WORDS, lowercase=True)
  13. X = vectorizer.fit_transform(doc)
英文:

Use custom tokenzier in orrder to avoid merging two words into one.

  1. def custom_tokenizer(text):
  2. return text.split()
  3. with open('cran.all', 'r') as f:
  4. content_string=""
  5. content =
  6. content = content_string.join(content)
  7. doc=re.split('.I\s[0-9]{1,4}',content)
  8. f.close()
  9. doc =
  10. del doc[0]
  11. doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]
  12. vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer, stop_words=text.ENGLISH_STOP_WORDS, lowercase=True)
  13. X = vectorizer.fit_transform(doc)

huangapple
  • 本文由 发表于 2023年2月19日 03:03:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/75495715.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定