英文:
Is there a way to find the EXACT string of a word in a discord message?
问题
目前我正在开发一个 Discord 机器人,用于过滤消息。我的问题出现在尝试过滤一些包含在其他单词中的词汇时,从而触发了重复的消息。
这是我的 filter.txt 文件:
sad
sadness
sadnesses
由于 "sad" 也可以在 "sadness" 中找到,所以当有人写下 "sadness" 时,会误判为 "sad"。
是否有可能仅检测消息中的完全匹配字符串?就像这样:```我想要快乐,因为悲伤很糟糕``` → '仅检测 sadness'。
我希望你能理解我的意思。
代码:
public void onGuildMessageReceived(GuildMessageReceivedEvent e) {
File file = new File("src/filter.txt");
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
if (!line.startsWith("#")) {
if (e.getMessage().getContentRaw().contains(line)) {
User user = e.getJDA().getUserById(e.getAuthor().getIdLong());
e.getMessage().delete().queue();
user.openPrivateChannel().queue(privateChannel -> {
privateChannel.sendMessage("Please watch your language!").queue();
});
}
}
}
} catch (IOException e1) {}
}
<details>
<summary>英文:</summary>
Currently I am working on a discord bot, which is filtering messages. My problem occurs when trying to filter words, which are included in others, thus triggering duplicate messages.
This is my filter.txt:
sad
sadness
sadnesses
Since "sad" can be found in "sadness" as well, I get a false-positive for "sad" whenever "sadness" is written.
Is it possible to only detect the exact string in a message? Like: ```I want to be happy, because sadness is bad ``` → '*Just detect sadness*'
I hope you understand what i mean.
Code:
public void onGuildMessageReceived(GuildMessageReceivedEvent e) {
File file = new File("src/filter.txt");
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
if(!line.startsWith("#")) {
if(e.getMessage().getContentRaw().contains(line)) {
User user = e.getJDA().getUserById(e.getAuthor().getIdLong());
e.getMessage().delete().queue();
user.openPrivateChannel().queue(privateChannel -> {
privateChannel.sendMessage("Bitte achte auf deine Sprache!").queue();
});
}
}
}
} catch (IOException e1) {}
}
</details>
# 答案1
**得分**: 2
正如 *Cardinal - Reinstate Monica* 和 *Hades* 已经说过的,你应该看一下正则表达式。
'正则表达式' 的缩写是 'Regex',它描述了用于字符串的搜索模式。
你可以使用正则表达式做很多事情,所以如果你想了解更多信息,可以查看这个 [教程](https://www.vogella.com/tutorials/JavaRegularExpressions/article.html)。
(这是我在谷歌搜索时找到的第一个教程,当然你也可以选择其他任何你喜欢的教程。)
针对你的用例,我建议以下步骤:
首先,不要使用 `String.contains()`,因为它只适用于字符串,而不适用于正则表达式。
改用 `String.matches()`,使用以下正则表达式:
"(?is).*\\bSTRING\\b.*"
因为进行了一些转义,这是去除转义后的正则表达式:
(?is).*\bSTRING\b.*
我将解释它是如何工作的。
**\b**
`\b` 匹配单词边界。单词字符包括 `a` - `z`,`A` - `Z`,`0` - `9` 和 `_`。这些字符的任意组合被视为一个单词。
这有一个优点,即你可以在以下情况下匹配单词 *sad*:
* "我很难过。" → 句子末尾的 `.` 不影响检测结果。
* "难过是我的事情" → 即使是第一个单词也能匹配。(这也受到 `.*` 的影响。)
当使用 *sadness* 时,它不会匹配 *sad*,因为单词在之后继续:
* "我感到了悲伤!" → 因为单词在 "sad" 之后没有结束,所以不匹配。匹配 *sadness* 将起作用。
**.***
`.` 匹配除某些换行符之外的任何字符。(`(?s)` 在这里帮了我一个忙。)
`*` 基本上表示它之前的部分出现零次或多次。
通过在字符串前后使用 `.*`,正则表达式可以接受围绕字符串的*任何*字符或字符组合(包括没有字符)。
这很重要,因为通过这种方式,单词可以出现在任何想象得到的句子中,并且无论如何都会匹配。
**(?is)**
`?i` 和 `?s` 启用了某些模式。
`?i` 使正则表达式对大小写不敏感。这意味着 *sadness*、*SADNESS* 或 *sAdNeSs* 都会匹配。
`?s` 启用了'单行模式',这只是表示 `.` 也匹配所有换行符。
`?i` 和 `?s` 可以组合成 `(?is)`,然后放在正则表达式前面。
在 `STRING` 处,你只需像这样插入你的单词:
"(?is).*\\b" + line + "\\b.*"
最终,你的代码将如下所示:
```java
public void onGuildMessageReceived(GuildMessageReceivedEvent e) {
File file = new File("src/filter.txt");
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
if(!line.startsWith("#")) {
if(e.getMessage().getContentRaw().matches("(?is).*\\b" + line + "\\b.*")) {
User user = e.getJDA().getUserById(e.getAuthor().getIdLong());
e.getMessage().delete().queue();
user.openPrivateChannel().queue(privateChannel -> {
privateChannel.sendMessage("Please mind your language!").queue();
});
}
}
}
} catch (IOException e1) {}
}
如果你希望每条消息只生成一条消息(在匹配后停止),你可以在匹配单词后和向用户发送消息后插入 return;
。
英文:
As Cardinal - Reinstate Monica and Hades already said, you should take a look at regex.
'Regex' stands for 'Regular expression' and describes search patterns for strings.
There is a lot you can do using regex, so if you want to know more about it, check out a tutorial.
(It's the first I found when googling, you can use any tutorial of your liking of course.)
For your use case I would suggest the following:
First off, don't use String.contains()
, as it only works with Strings, not with regex.
Use String.matches()
instead with the following regex:
"(?is).*\\bSTRING\\b.*"
Because there is some escaping done, this is what the regex would look like without it:
(?is).*\bSTRING\b.*
I will explain how it works.
\b
\b
matches a word boundary. Word characters are a
- z
, A
- Z
, 0
- 9
and _
. Any combination of this characters is considered a word.
This has the advantage, that you can match the word sad in the following cases:
- "I am sad." → The
.
at the end of the sentence doesn't influence the detection. - "sad is my thing" → The word is matched even when it's the first one. (This is also influenced by
.*
.)
When using sadness, it won't match sad, as the word continues afterwards:
- "I am feeling the sadness!" → Because the word doesn't end after "sad", it's not a match. Matching "sadness" would work.
.*
.
matches any character except some line breaks. ((?s)
helps me out here.)
*
basically says, that the part in front of it occurs zero or more times.
By using a .*
before and after the string, the regex is fine with any character or combination of characters (including no characters) surrounding the string.
That's important, because in this way the words can be placed in every imaginable sentence and will always match not matter what.
(?is)
?i
and ?s
enable certain modes.
?i
makes the regex case insensitive. This means, it doesn't matter if is's sadness, SADNESS or sAdNeSs; all three will match.
?s
enables the 'single line mode', which just means, that .
is matching all line breaks as well.
?i
and ?s
can be combined to (?is)
and then placed in front of the regex.
Instead of STRING
you just have to insert your words like this:
"(?is).*\\b" + line + "\\b.*"
Your code would look like this in the end:
public void onGuildMessageReceived(GuildMessageReceivedEvent e) {
File file = new File("src/filter.txt");
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
if(!line.startsWith("#")) {
if(e.getMessage().getContentRaw().matches("(?is).*\\b" + line + "\\b.*")) {
User user = e.getJDA().getUserById(e.getAuthor().getIdLong());
e.getMessage().delete().queue();
user.openPrivateChannel().queue(privateChannel -> {
privateChannel.sendMessage("Bitte achte auf deine Sprache!").queue();
});
}
}
}
} catch (IOException e1) {}
}
If you want it to only generate one message per message (thus stopping after the first match) you could just insert a return;
after matching a word and after sending the message to the user.
答案2
得分: 0
你还可以尝试使用字符串搜索算法,比如Aho-Corasick算法,但这需要实现一个适当的签名表。像这样的算法在更大的单词列表上效果会更好。
请注意,这种算法很容易被规避。仅仅添加空格或使用1337字符替换就足以使一个简单的单词过滤器失效。
英文:
You could also try using a string searching algorithm such as Aho-Corasick, but that would require implementing a proper signature table. An algorithm like this would be a lot better at a bigger list of words.
Note that such algorithms are easily circumvented. Simply adding whitespace or using 1337 character replacement would outsmart a naive word filter.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论