检查给定句子(查询)是否包含预定义关键词的最佳方式

huangapple go评论90阅读模式
英文:

Optimal way to check if given sentence(query) contains any of the predefined keywords

问题

以下是已翻译的内容:

这可能看起来像是一个已经被回答了无数次的简单问题,但我找不到最佳的方式(使用一些数据库)。

我有数千个关键词的列表(假设是恶意词汇)。每当有人发布一条消息(长句子或段落),我希望检查给定的句子是否包含任何关键词,以便我可以封锁用户或采取其他行动。

我正在寻找一个可以解决上述问题并在几毫秒内(<15毫秒)给出响应的数据库/模式。

有许多数据库可以解决上述问题的反面:给定关键词,查找包含关键词的文档(文本搜索)。

英文:

It might look like a simple question already answered countless times, but I could not find the optimal way(using some db).

I have a list of few thousands keywords(let's say abusive words). Whenever someone posts a message(long sentence or a paragraph), I want to check if the given sentence contains any of the keywords, so that I can block user or take other actions.

I am looking for a db/schema which can solve the above problem and gives response in a few milliseconds(<15ms).

There are many dbs which solves the reverse of the above problem: given the keywords, find documents containing keywords(text search).

答案1

得分: 2

尝试使用ClickHouse处理您的工作负载。

根据文档:

> multiMatchAny(...)如果没有匹配任何正则表达式,则返回0,如果有任何模式匹配,则返回1。它使用hyperscan库。对于在字符串中搜索子字符串的模式,最好使用multiSearchAny,因为它的速度要快得多。
任何绳索字符串的长度必须小于2^32字节。

英文:

Try ClickHouse for your workload.

According to docs:

> multiMatchAny(...) returns 0 if none of the regular expressions are matched and 1 if any of the patterns matches. It uses hyperscan library. For patterns to search substrings in a string, it is better to use multiSearchAny since it works much faster.
The length of any of the haystack string must be less than 2^32 bytes.

huangapple
  • 本文由 发表于 2020年1月6日 15:46:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/59608395.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定