2023年2月16日 16:22:21go评论56阅读模式

英文:

Efficient data structure to look up co-occurrence within a document

问题

要求1：

给定一组标记，比如(t1, t2)（实际上可能会有更多的标记），其中t1, t2属于V，我想找到所有与t1, t2在至少一个文档中共同出现的标记。也就是说，找到所有的标记u，满足u在V且t1, t2, u构成至少一个文档的子集。

要求2：

给定一组标记，找到包含这些标记的文档。

是否有高效的数据结构来满足我的要求？ "高效" 意味着避免迭代所有文档。

英文:

I have 10 million documents. Each document is a set of tokens (about 100 unique tokens) regardless of their frequency (bag-of-words). All unique tokens in all documents form the vocabulary V. The size of V is roughly 50000.

Requirement 1:

Given a set of tokens, say (t1, t2) (there may be more than two tokens in practice) where t1,t2 in V, I want to find all tokens that co-occur with t1,t2 in at least one document. That is to say, find all tokens u, satisfying u in V and t1,t2,u constituting a subset of at least one document.

Requirement 2:

Given a set of tokens, find documents that contain these tokens.

Is there any efficient data structure to fulfill my requirements? "Efficienct" means avoid iterating all documents.

答案1

得分: 2

以下是代码部分的翻译：

You can use a map (usually implemented as either a hashmap or a binary tree) that maps each token to the set of documents that contain this token.

当给定两个标记 t1, t2 时，计算两个对应的文档集合的交集。这给出了包含 t1 和 t2 的文档集合。然后返回包含这些文档中所有标记的并集。

In python:

from collections import defaultdict

def build_map(documents):
    m = defaultdict(set)
    for i, document in enumerate(documents):
        for token in document:
            m[token].add(i)
    return m

def cooccuring_with_pair(m, documents, t1, t2):
    doc_ids = set.intersection(m[t1], m[t2])
    return set().union(*(documents[i] for i in doc_ids))

Testing the python code with a small example:

documents = [set(s.split()) for s in ('the quick brown fox jumps over the lazy dog', 'the grey cat jumps in fright when hearing the dog', 'two brown mice are chased by a dog', 'the quick brown fox grows old and slow')]

m = build_map(documents)
print(m)
# defaultdict(<class 'set'>, {'brown': {0, 2, 3}, 'lazy': {0}, 'fox': {0, 3}, 'the': {0, 1, 3}, 'over': {0}, 'jumps': {0, 1}, 'dog': {0, 1, 2}, 'quick': {0, 3}, 'cat': {1}, 'hearing': {1}, 'fright': {1}, 'grey': {1}, 'when': {1}, 'in': {1}, 'mice': {2}, 'by': {2}, 'a': {2}, 'chased': {2}, 'two': {2}, 'are': {2}, 'and': {3}, 'slow': {3}, 'grows': {3}, 'old': {3}})

for t1, t2 in [('jumps', 'dog'), ('brown', 'fox'), ('cat', 'fox')]:
    print(t1, t2)
    print(cooccuring_with_pair(m, documents, t1, t2))
    print()

希望这对你有帮助。如果有其他翻译需求，请随时告诉我。

英文:

You can use a map (usually implemented as either a hashmap or a binary tree) that maps each token to the set of documents that contain this token.

When given two tokens t1,t2, compute the intersection of the two corresponding sets of documents. This gives you the set of documents that contain both t1 and t2. Then return the union of all the tokens contained in these documents.

In python:

from collections import defaultdict

def build_map(documents):
    m = defaultdict(set)
    for i, document in enumerate(documents):
        for token in document:
            m[token].add(i)
    return m

def cooccuring_with_pair(m, documents, t1, t2):
    doc_ids = set.intersection(m[t1], m[t2])
    return set().union(*(documents[i] for i in doc_ids))

Testing the python code with a small example:

documents = [set(s.split()) for s in (&#39;the quick brown fox jumps over the lazy dog&#39;, &#39;the grey cat jumps in fright when hearing the dog&#39;, &#39;two brown mice are chased by a dog&#39;, &#39;the quick brown fox grows old and slow&#39;)]

m = build_map(documents)
print(m)
# defaultdict(&lt;class &#39;set&#39;&gt;, {&#39;brown&#39;: {0, 2, 3}, &#39;lazy&#39;: {0}, &#39;fox&#39;: {0, 3}, &#39;the&#39;: {0, 1, 3}, &#39;over&#39;: {0}, &#39;jumps&#39;: {0, 1}, &#39;dog&#39;: {0, 1, 2}, &#39;quick&#39;: {0, 3}, &#39;cat&#39;: {1}, &#39;hearing&#39;: {1}, &#39;fright&#39;: {1}, &#39;grey&#39;: {1}, &#39;when&#39;: {1}, &#39;in&#39;: {1}, &#39;mice&#39;: {2}, &#39;by&#39;: {2}, &#39;a&#39;: {2}, &#39;chased&#39;: {2}, &#39;two&#39;: {2}, &#39;are&#39;: {2}, &#39;and&#39;: {3}, &#39;slow&#39;: {3}, &#39;grows&#39;: {3}, &#39;old&#39;: {3}})

for t1, t2 in [(&#39;jumps&#39;, &#39;dog&#39;), (&#39;brown&#39;, &#39;fox&#39;), (&#39;cat&#39;, &#39;fox&#39;)]:
    print(t1, t2)
    print(cooccuring_with_pair(m, documents, t1, t2))
    print()
# jumps dog
# {&#39;brown&#39;, &#39;lazy&#39;, &#39;fox&#39;, &#39;cat&#39;, &#39;the&#39;, &#39;hearing&#39;, &#39;over&#39;, &#39;jumps&#39;, &#39;dog&#39;, &#39;fright&#39;, &#39;grey&#39;, &#39;when&#39;, &#39;quick&#39;, &#39;in&#39;}
#
# brown fox
# {&#39;brown&#39;, &#39;lazy&#39;, &#39;and&#39;, &#39;fox&#39;, &#39;the&#39;, &#39;slow&#39;, &#39;grows&#39;, &#39;old&#39;, &#39;over&#39;, &#39;jumps&#39;, &#39;dog&#39;, &#39;quick&#39;}
#
# cat fox
# set()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

高效的数据结构用于在文档中查找共现。

问题

答案1

Enum vs Struct在Swift中的特定情况下。

完全二叉树中两个子树的节点数

A*算法在解决八数码谜题时如何解决平局。

如何根据一个列表对两个不同大小的列表进行重新排序？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论