英文:
Optimal way to check if given sentence(query) contains any of the predefined keywords
问题
以下是已翻译的内容:
这可能看起来像是一个已经被回答了无数次的简单问题,但我找不到最佳的方式(使用一些数据库)。
我有数千个关键词的列表(假设是恶意词汇)。每当有人发布一条消息(长句子或段落),我希望检查给定的句子是否包含任何关键词,以便我可以封锁用户或采取其他行动。
我正在寻找一个可以解决上述问题并在几毫秒内(<15毫秒)给出响应的数据库/模式。
有许多数据库可以解决上述问题的反面:给定关键词,查找包含关键词的文档(文本搜索)。
英文:
It might look like a simple question already answered countless times, but I could not find the optimal way(using some db).
I have a list of few thousands keywords(let's say abusive words). Whenever someone posts a message(long sentence or a paragraph), I want to check if the given sentence contains any of the keywords, so that I can block user or take other actions.
I am looking for a db/schema which can solve the above problem and gives response in a few milliseconds(<15ms).
There are many dbs which solves the reverse of the above problem: given the keywords, find documents containing keywords(text search).
答案1
得分: 2
尝试使用ClickHouse处理您的工作负载。
根据文档:
> multiMatchAny(...)如果没有匹配任何正则表达式,则返回0,如果有任何模式匹配,则返回1。它使用hyperscan库。对于在字符串中搜索子字符串的模式,最好使用multiSearchAny,因为它的速度要快得多。
任何绳索字符串的长度必须小于2^32字节。
英文:
Try ClickHouse for your workload.
According to docs:
> multiMatchAny(...) returns 0 if none of the regular expressions are matched and 1 if any of the patterns matches. It uses hyperscan library. For patterns to search substrings in a string, it is better to use multiSearchAny since it works much faster.
The length of any of the haystack string must be less than 2^32 bytes.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论