如何根据查询找到相关文件

huangapple go评论70阅读模式
英文:

How do I find relavence documents against the query

问题

以下是您提供的代码的翻译部分:

我正在进行一个项目其中我必须逐个针对查询找到相关文档首先我计算了所有文档中所有单词的**TFIDF**然后我将TF和IDF相乘并将每个词及其相应的TF-IDF分数存储在一个名为List的列表中这里有一个名为Tfidf的类来计算TF和IDF

public double TF(String[] document, String term) {
    double value = 0;                 // 计算所有词的词频
    for (String s : document) {
        if (s.equalsIgnoreCase(term)) {
            tfmap.put(s, tfmap.getOrDefault(term, 0) + 1);
            for (Map.Entry entry : tfmap.entrySet()) {
                value = (int) entry.getValue();
            }
        }
    }
    return value / document.length;
}

public double idf(List alldocument, String term) {
    double b = alldocument.size();
    double count = 0;
    for (int i = 0; i < alldocument.size(); i++) {
        String[] f = alldocument.get(i).toString().replaceAll("[^a-zA-Z0-9 ]", " ").trim().replaceAll(" +", " ").toLowerCase().split(" ");

        for (String ss : f) {
            if (ss.equalsIgnoreCase(term)) {
                count++;
                break;
            }
        }
    }
    return 1 + Math.log(b / count);
}}

这是我将TF和IDF相乘的代码

List<String> alldocument = new ArrayList<>();
List tfidfVector = new ArrayList<>();
public void TfIdf() {
    double tf;
    double idf;
    double tfidf = 0;

    for (int i = 0; i < alldocument.size(); i++) {
        double[] tfidfvector = new double[allterm.size()];  // allterm是所有文档中的唯一单词
        for (String terms : allterm) {
            String[] file = alldocument.get(i).replaceAll("[^a-zA-Z0-9 ]", " ").trim().replaceAll(" +", " ").toLowerCase().split(" ");
            int count = 0;
            tf = new Tfidf().TF(file, terms);
            idf = new Tfidf().idf(alldocument, terms);
            tfidf = tf * idf;
            tfidfvector[count] = tfidf;
            count++;
        }
        tfidfVector.add(tfidfvector);            
    }   
}

谁能告诉我如何计算查询的TF-IDF向量如果我的查询是**生活与学习**”,如何计算查询与所有文档之间的余弦相似度以找到查询与所有文档之间的相似度
英文:

I going through a project where I have to find the relevant document one by one against the query. First I calculated the TF, IDF for all the words of all documents. And then I multiplied the TF and IDF and store each term and its corresponding TF-IDF score for a particular document inside a List.here the class named Tfidf calculating TF and IDF

public double TF(String[] document, String term) {
double value = 0;                 //calculate Term Frequency for all term
for (String s : document) {
if (s.equalsIgnoreCase(term)) {
tfmap.put(s, tfmap.getOrDefault(term, 0) + 1);
for (Map.Entry entry : tfmap.entrySet()) {
value = (int) entry.getValue();
}
}
}
return value / document.length;
}
public double idf(List alldocument, String term) {
double b = alldocument.size();
double count = 0;
for (int i = 0; i &lt; alldocument.size(); i++) {
String[] f = alldocument.get(i).toString().replaceAll(&quot;[^a-zA-Z0-9 ]&quot;, &quot; &quot;).trim().replaceAll(&quot; +&quot;, &quot; &quot;).toLowerCase().split(&quot; &quot;);
for (String ss : f) {
if (ss.equalsIgnoreCase(term)) {
count++;
break;
}
}
}
return 1 + Math.log(b / count);
}}

here the code where I multiplied the TF and IDF

  List&lt;String&gt; alldocument= new ArrayList&lt;&gt;();
List tfidfVector = new ArrayList&lt;&gt;();
public void TfIdf() {
double tf;
double idf;
double tfidf = 0;
for (int i = 0; i &lt; alldocument.size(); i++) {
double[] tfidfvector = new double[allterm.size()];  //allterm is all unique word in all documents
for (String terms : allterm) {
String[] file = alldocument.get(i).replaceAll(&quot;[^a-zA-Z0-9 ]&quot;, &quot; &quot;).trim().replaceAll(&quot; +&quot;, &quot; &quot;).toLowerCase().split(&quot; &quot;);
int count = 0;
tf = new Tfidf().TF(file, terms);
idf = new Tfidf().idf(alldocument, terms);
tfidf = tf * idf;
tfidfvector[count] = tfidf;
count++;
}
tfidfVector.add(tfidfvector);            
}   
}

can anyone tell me how I Compute the TF-IDF vector for the query If my query is "life and learning"?and how can I calculate the cosine similarity of the query between all the Documents to find the similarity between the query and all the document?

答案1

得分: 0

tf-idf分数与查询和文档之间的余弦相似性一起使用。因此,您需要计算两个向量之间的点积。一个向量表示查询“生活和学习”。另一个向量表示其中一个文档。为了找到最相关的文档,您需要计算与所有文档(或理想情况下,只包含某些词汇的文档)的余弦相似性。在向量空间模型中,向量的每个维度代表一个不同的单词。因此,在这个特定的例子中,唯一两个相关的维度将是表示“生活”、“和”、“学习”的维度。从理论上讲,还有其他维度对应于每个其他已知单词,但在这种情况下,这些维度的分数将为0,因此在计算余弦相似性时可以跳过它们。

有几种可能的权重应用方式。但是如果我们坚持使用最简单的方法...
您可以认为查询向量为<life:1, and:1, learning:1>。文档向量为<word1:tf_word1/idf_word1, word2:tf_word2/idf_word2, ..., wordN:tf_wordN/idf_wordN>。
然后,只需计算这两个向量之间的点积。对于不出现在查询中的单词,您最终将乘以0并将其添加到分数中,这意味着您可以忽略所有这些单词。因此,您只需要考虑查询中的术语。对于查询中的每个术语,将查询中的tf(如果需要,甚至只需为1)与文档中该术语的tf-idf分数相乘。

您可以使用许多可能的加权变体。您已经谈论了TF和IDF。我还看到您编写了一些代码来通过文档长度对术语频率进行归一化,这也是可以的。要了解更多信息,您可以参考《信息检索导论》教材的这一部分:https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html#sec:querydocweighting。(不过,如果不阅读一些较早的部分,可能会显得太过密集)

另外,关于您发布的代码,您目前正在一个ArrayList中存储它们,按顺序进行索引。嗯,在循环的每次通过中,计数被重置为0,这似乎不太对。但是忽略这一点,您将需要一种轻松查找特定术语的tf-idf信息的方法。与ArrayList相比,HashTable更适合这个任务。

英文:

The tf-idf score is used in conjunction with the cosine similarity between the query and a document. So you need to compute a dot product between two vectors. One vector represents the query "life and learning". The other vector represents one of the documents. To find the most relevant document, you need to calculate the cosine similarity with all of the documents (or ideally, only the ones that contain some of the words).
In the vector space model, each dimension of the vector represents one distinct word. So in this particular example, the only two relevant dimensions would be those representing "life", "and", "learning". Theoretically there are other dimensions corresponding to every other known word, but the score of those will be 0 in this case so they can be skipped over when calculating the cosine similarity.

There are several possible variations of exactly how to apply the weighting. But if we stick to the simplest...
You can consider the query vector to have <life:1, and:1 learning: 1>. And the document vector to have <word1: tf_word1/idf_word1, word2: tf_word2/idf_word2, ..., wordN: tf_wordN/idf_wordN>.
Then you just compute the dot product between these two vectors. For words not appearing in the query, you're just going to end up multiplying by 0 and adding it to the score, which means you can ignore all those words. So you only need to consider the terms in the query. For each term in the query, multiply the tf in the query (or even just 1 if desired) by the tf-idf score of that term in the document.

There are many possible weighting variations you can use. You've already talked about TF and IDF. I also see you've written some code to normalize the term frequency by the length of the document, which is also. To learn more, you can refer to this section of the Introduction to Information Retrieval Textbook: https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html#sec:querydocweighting. (Although, it might be too dense without reading some of the earlier sections)

FYI, sepcifically regarding the code you've posted, you're currently storing them in an ArrayList, just indexed in order. Well, you have an issue where count is being reset to 0 every time through the loop, which doesn't seem right. But ignoring that, you will want an easy way to look up the tf-idf information for a specific term. A HashTable is more natural for that than an ArrayList.

huangapple
  • 本文由 发表于 2020年9月11日 01:29:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/63834863.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定