英文:
Efficient data structure to look up co-occurrence within a document
问题
要求1:
给定一组标记,比如(t1, t2)
(实际上可能会有更多的标记),其中t1, t2
属于V
,我想找到所有与t1, t2
在至少一个文档中共同出现的标记。也就是说,找到所有的标记u
,满足u在V
且t1, t2, u
构成至少一个文档的子集。
要求2:
给定一组标记,找到包含这些标记的文档。
是否有高效的数据结构来满足我的要求? "高效" 意味着避免迭代所有文档。
英文:
I have 10 million documents. Each document is a set of tokens (about 100 unique tokens) regardless of their frequency (bag-of-words). All unique tokens in all documents form the vocabulary V
. The size of V
is roughly 50000.
Requirement 1:
Given a set of tokens, say (t1, t2)
(there may be more than two tokens in practice) where t1,t2 in V
, I want to find all tokens that co-occur with t1,t2
in at least one document. That is to say, find all tokens u
, satisfying u in V
and t1,t2,u
constituting a subset of at least one document.
Requirement 2:
Given a set of tokens, find documents that contain these tokens.
Is there any efficient data structure to fulfill my requirements? "Efficienct" means avoid iterating all documents.
答案1
得分: 2
以下是代码部分的翻译:
You can use a map (usually implemented as either a hashmap or a binary tree) that maps each token to the set of documents that contain this token.
当给定两个标记 t1, t2 时,计算两个对应的文档集合的交集。这给出了包含 t1 和 t2 的文档集合。然后返回包含这些文档中所有标记的并集。
In python:
from collections import defaultdict
def build_map(documents):
m = defaultdict(set)
for i, document in enumerate(documents):
for token in document:
m[token].add(i)
return m
def cooccuring_with_pair(m, documents, t1, t2):
doc_ids = set.intersection(m[t1], m[t2])
return set().union(*(documents[i] for i in doc_ids))
Testing the python code with a small example:
documents = [set(s.split()) for s in ('the quick brown fox jumps over the lazy dog', 'the grey cat jumps in fright when hearing the dog', 'two brown mice are chased by a dog', 'the quick brown fox grows old and slow')]
m = build_map(documents)
print(m)
# defaultdict(<class 'set'>, {'brown': {0, 2, 3}, 'lazy': {0}, 'fox': {0, 3}, 'the': {0, 1, 3}, 'over': {0}, 'jumps': {0, 1}, 'dog': {0, 1, 2}, 'quick': {0, 3}, 'cat': {1}, 'hearing': {1}, 'fright': {1}, 'grey': {1}, 'when': {1}, 'in': {1}, 'mice': {2}, 'by': {2}, 'a': {2}, 'chased': {2}, 'two': {2}, 'are': {2}, 'and': {3}, 'slow': {3}, 'grows': {3}, 'old': {3}})
for t1, t2 in [('jumps', 'dog'), ('brown', 'fox'), ('cat', 'fox')]:
print(t1, t2)
print(cooccuring_with_pair(m, documents, t1, t2))
print()
希望这对你有帮助。如果有其他翻译需求,请随时告诉我。
英文:
You can use a map (usually implemented as either a hashmap or a binary tree) that maps each token to the set of documents that contain this token.
When given two tokens t1,t2, compute the intersection of the two corresponding sets of documents. This gives you the set of documents that contain both t1 and t2. Then return the union of all the tokens contained in these documents.
In python:
from collections import defaultdict
def build_map(documents):
m = defaultdict(set)
for i, document in enumerate(documents):
for token in document:
m[token].add(i)
return m
def cooccuring_with_pair(m, documents, t1, t2):
doc_ids = set.intersection(m[t1], m[t2])
return set().union(*(documents[i] for i in doc_ids))
Testing the python code with a small example:
documents = [set(s.split()) for s in ('the quick brown fox jumps over the lazy dog', 'the grey cat jumps in fright when hearing the dog', 'two brown mice are chased by a dog', 'the quick brown fox grows old and slow')]
m = build_map(documents)
print(m)
# defaultdict(<class 'set'>, {'brown': {0, 2, 3}, 'lazy': {0}, 'fox': {0, 3}, 'the': {0, 1, 3}, 'over': {0}, 'jumps': {0, 1}, 'dog': {0, 1, 2}, 'quick': {0, 3}, 'cat': {1}, 'hearing': {1}, 'fright': {1}, 'grey': {1}, 'when': {1}, 'in': {1}, 'mice': {2}, 'by': {2}, 'a': {2}, 'chased': {2}, 'two': {2}, 'are': {2}, 'and': {3}, 'slow': {3}, 'grows': {3}, 'old': {3}})
for t1, t2 in [('jumps', 'dog'), ('brown', 'fox'), ('cat', 'fox')]:
print(t1, t2)
print(cooccuring_with_pair(m, documents, t1, t2))
print()
# jumps dog
# {'brown', 'lazy', 'fox', 'cat', 'the', 'hearing', 'over', 'jumps', 'dog', 'fright', 'grey', 'when', 'quick', 'in'}
#
# brown fox
# {'brown', 'lazy', 'and', 'fox', 'the', 'slow', 'grows', 'old', 'over', 'jumps', 'dog', 'quick'}
#
# cat fox
# set()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论