高效的数据结构用于在文档中查找共现。

huangapple go评论56阅读模式
英文:

Efficient data structure to look up co-occurrence within a document

问题

要求1

给定一组标记,比如(t1, t2)(实际上可能会有更多的标记),其中t1, t2属于V,我想找到所有与t1, t2在至少一个文档中共同出现的标记。也就是说,找到所有的标记u,满足u在Vt1, t2, u构成至少一个文档的子集

要求2

给定一组标记,找到包含这些标记的文档

是否有高效的数据结构来满足我的要求? "高效" 意味着避免迭代所有文档。

英文:

I have 10 million documents. Each document is a set of tokens (about 100 unique tokens) regardless of their frequency (bag-of-words). All unique tokens in all documents form the vocabulary V. The size of V is roughly 50000.

Requirement 1:

Given a set of tokens, say (t1, t2) (there may be more than two tokens in practice) where t1,t2 in V, I want to find all tokens that co-occur with t1,t2 in at least one document. That is to say, find all tokens u, satisfying u in V and t1,t2,u constituting a subset of at least one document.

Requirement 2:

Given a set of tokens, find documents that contain these tokens.

Is there any efficient data structure to fulfill my requirements? "Efficienct" means avoid iterating all documents.

答案1

得分: 2

以下是代码部分的翻译:

You can use a map (usually implemented as either a hashmap or a binary tree) that maps each token to the set of documents that contain this token.

当给定两个标记 t1, t2 时,计算两个对应的文档集合的交集。这给出了包含 t1 和 t2 的文档集合。然后返回包含这些文档中所有标记的并集。

In python:

from collections import defaultdict

def build_map(documents):
    m = defaultdict(set)
    for i, document in enumerate(documents):
        for token in document:
            m[token].add(i)
    return m

def cooccuring_with_pair(m, documents, t1, t2):
    doc_ids = set.intersection(m[t1], m[t2])
    return set().union(*(documents[i] for i in doc_ids))

Testing the python code with a small example:

documents = [set(s.split()) for s in ('the quick brown fox jumps over the lazy dog', 'the grey cat jumps in fright when hearing the dog', 'two brown mice are chased by a dog', 'the quick brown fox grows old and slow')]

m = build_map(documents)
print(m)
# defaultdict(<class 'set'>, {'brown': {0, 2, 3}, 'lazy': {0}, 'fox': {0, 3}, 'the': {0, 1, 3}, 'over': {0}, 'jumps': {0, 1}, 'dog': {0, 1, 2}, 'quick': {0, 3}, 'cat': {1}, 'hearing': {1}, 'fright': {1}, 'grey': {1}, 'when': {1}, 'in': {1}, 'mice': {2}, 'by': {2}, 'a': {2}, 'chased': {2}, 'two': {2}, 'are': {2}, 'and': {3}, 'slow': {3}, 'grows': {3}, 'old': {3}})

for t1, t2 in [('jumps', 'dog'), ('brown', 'fox'), ('cat', 'fox')]:
    print(t1, t2)
    print(cooccuring_with_pair(m, documents, t1, t2))
    print()

希望这对你有帮助。如果有其他翻译需求,请随时告诉我。

英文:

You can use a map (usually implemented as either a hashmap or a binary tree) that maps each token to the set of documents that contain this token.

When given two tokens t1,t2, compute the intersection of the two corresponding sets of documents. This gives you the set of documents that contain both t1 and t2. Then return the union of all the tokens contained in these documents.

In python:

from collections import defaultdict

def build_map(documents):
    m = defaultdict(set)
    for i, document in enumerate(documents):
        for token in document:
            m[token].add(i)
    return m

def cooccuring_with_pair(m, documents, t1, t2):
    doc_ids = set.intersection(m[t1], m[t2])
    return set().union(*(documents[i] for i in doc_ids))

Testing the python code with a small example:

documents = [set(s.split()) for s in (&#39;the quick brown fox jumps over the lazy dog&#39;, &#39;the grey cat jumps in fright when hearing the dog&#39;, &#39;two brown mice are chased by a dog&#39;, &#39;the quick brown fox grows old and slow&#39;)]

m = build_map(documents)
print(m)
# defaultdict(&lt;class &#39;set&#39;&gt;, {&#39;brown&#39;: {0, 2, 3}, &#39;lazy&#39;: {0}, &#39;fox&#39;: {0, 3}, &#39;the&#39;: {0, 1, 3}, &#39;over&#39;: {0}, &#39;jumps&#39;: {0, 1}, &#39;dog&#39;: {0, 1, 2}, &#39;quick&#39;: {0, 3}, &#39;cat&#39;: {1}, &#39;hearing&#39;: {1}, &#39;fright&#39;: {1}, &#39;grey&#39;: {1}, &#39;when&#39;: {1}, &#39;in&#39;: {1}, &#39;mice&#39;: {2}, &#39;by&#39;: {2}, &#39;a&#39;: {2}, &#39;chased&#39;: {2}, &#39;two&#39;: {2}, &#39;are&#39;: {2}, &#39;and&#39;: {3}, &#39;slow&#39;: {3}, &#39;grows&#39;: {3}, &#39;old&#39;: {3}})

for t1, t2 in [(&#39;jumps&#39;, &#39;dog&#39;), (&#39;brown&#39;, &#39;fox&#39;), (&#39;cat&#39;, &#39;fox&#39;)]:
    print(t1, t2)
    print(cooccuring_with_pair(m, documents, t1, t2))
    print()
# jumps dog
# {&#39;brown&#39;, &#39;lazy&#39;, &#39;fox&#39;, &#39;cat&#39;, &#39;the&#39;, &#39;hearing&#39;, &#39;over&#39;, &#39;jumps&#39;, &#39;dog&#39;, &#39;fright&#39;, &#39;grey&#39;, &#39;when&#39;, &#39;quick&#39;, &#39;in&#39;}
#
# brown fox
# {&#39;brown&#39;, &#39;lazy&#39;, &#39;and&#39;, &#39;fox&#39;, &#39;the&#39;, &#39;slow&#39;, &#39;grows&#39;, &#39;old&#39;, &#39;over&#39;, &#39;jumps&#39;, &#39;dog&#39;, &#39;quick&#39;}
#
# cat fox
# set()

huangapple
  • 本文由 发表于 2023年2月16日 16:22:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75469512.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定