Java正则表达式 vs. PHP,悬空元字符’?’

huangapple go评论62阅读模式
英文:

Java Regex vs. PHP, Dangling meta character '?'

问题

我会用中文为您进行翻译:

我正在为这个问题打上PHP的标签,尽管它是一个关于Java的问题。正则表达式是从PHP源代码中复制的,所以我希望一些熟悉PHP的人能帮助解答这个问题。

我决定构建一个简单的垃圾邮件过滤器,只是为了好玩,然后我从MediaWiki复制了垃圾邮件拦截列表:https://meta.wikimedia.org/wiki/Spam_blacklist

大部分情况下这似乎是有效的,但有一些模式会出现语法错误。我不知道这是拼写错误还是PHP与Java使用了不同的语法。有人能帮我修复这些正则表达式,使它们能够编译通过吗?

以下是问题所在:

java.util.regex.PatternSyntaxException: 索引17附近有悬空的元字符“?”
\bfacebo(?:o[ob]|?o)k\.com\b
                 ^
java.util.regex.PatternSyntaxException: 索引5附近有悬空的元字符“?”
\b????\.tk\b
     ^
java.util.regex.PatternSyntaxException: 索引0附近有悬空的元字符“?”
??\.xsl\.pt\b
^
java.util.regex.PatternSyntaxException: 索引4附近有悬空的元字符“?”
\b????\.shop\b
    ^
java.util.regex.PatternSyntaxException: 索引4附近有悬空的元字符“?”
\b???\.??\b
    ^

以下是编译这些正则表达式的代码,如果您感兴趣的话。我认为这不会产生影响。

private static synchronized void init() throws IOException {
      
      if( blackListPatterns.get() != null ) return;
      InputStream blacklistfile = SpamBlackList.class.getResourceAsStream( "blacklist.txt" );
      BufferedReader buf = new BufferedReader( new InputStreamReader( blacklistfile, "UTF-8" ) );
      ArrayList<String> blacklist = new ArrayList<>( 12000 );
      for( String line; (line = buf.readLine()) != null; )
         if( !line.isBlank() && line.trim().charAt(0) != '#' )
            blacklist.add( line );
      ArrayList<Pattern> tempPatterns = new ArrayList<>( blacklist.size() );
      for( String pat : blacklist )
         try {
            tempPatterns.add( Pattern.compile( pat ) );
         } catch ( java.util.regex.PatternSyntaxException ex ) {
            System.err.println( ex );  // 应该记录这个,像FINER一样的低级别
         }
      blackListPatterns = new WeakReference<>( tempPatterns );
   }
   
   private static volatile WeakReference<List<Pattern>> 
           blackListPatterns = new WeakReference( null );
英文:

I'm tagging this with PHP even though it's a Java question. The regex is copied from a PHP source so I'm hoping some PHPers can help with the question.

I decided to build a simple spam filter, just for fun, and I copied the spam blocklist from MediaWiki: https://meta.wikimedia.org/wiki/Spam_blacklist

Mostly this seems to work, but a few of the patterns fail with a syntax error. I don't know if this is a typo or if PHP uses a different syntax than Java. Can anyone help me fixing these regex so that they compile?

Here's the problems:

java.util.regex.PatternSyntaxException: Dangling meta character &#39;?&#39; near index 17
\bfacebo(?:o[ob]|?o)k\.com\b
                 ^
java.util.regex.PatternSyntaxException: Dangling meta character &#39;?&#39; near index 5
\b????\.tk\b
     ^
java.util.regex.PatternSyntaxException: Dangling meta character &#39;?&#39; near index 0
??\.xsl\.pt\b
^
java.util.regex.PatternSyntaxException: Dangling meta character &#39;?&#39; near index 4
\b????\.shop\b
    ^
java.util.regex.PatternSyntaxException: Dangling meta character &#39;?&#39; near index 4
\b???\.??\b
    ^

Here's the code that compiles them, in case you're interested. I don't think it makes a difference though.

   private static synchronized void init() throws IOException {
      
      if( blackListPatterns.get() != null ) return;
      InputStream blacklistfile = SpamBlackList.class.getResourceAsStream( &quot;blacklist.txt&quot; );
      BufferedReader buf = new BufferedReader( new InputStreamReader( blacklistfile, &quot;UTF-8&quot; ) );
      ArrayList&lt;String&gt; blacklist = new ArrayList&lt;&gt;( 12000 );
      for( String line; (line = buf.readLine()) != null; )
         if( !line.isBlank() &amp;&amp; line.trim().charAt(0) != &#39;#&#39; )
            blacklist.add( line );
      ArrayList&lt;Pattern&gt; tempPatterns = new ArrayList&lt;&gt;( blacklist.size() );
      for( String pat : blacklist )
         try {
            tempPatterns.add( Pattern.compile( pat ) );
         } catch ( java.util.regex.PatternSyntaxException ex ) {
            System.err.println( ex );  // should log this, low level like FINER
         }
      blackListPatterns = new WeakReference&lt;&gt;( tempPatterns );
   }
   
   private static volatile WeakReference&lt;List&lt;Pattern&gt;&gt; 
           blackListPatterns = new WeakReference( null );

答案1

得分: 2

你下载的副本 https://meta.wikimedia.org/wiki/Spam_blacklist (blacklist.txt) 已损坏。那些悬挂的问号是非ASCII字符,例如,\bfacebo(?:o[ob]|?o)k\.com\b 实际上是 \bfacebo(?:o[ob]|\ıo)k\.com\b。注意那个无点的 "ı"。

请下载 https://meta.wikimedia.org/wiki/Spam_blacklist?action=raw 并考虑它是UTF-8编码。

你可能希望在正则表达式中添加Unicode标志。同时请注意:
> 这里所称的正则表达式并不是真正的正则表达式,而是插入到硬编码正则表达式中的子模式。即上面提到的子模式 "Foo" 会创建一个类似 /^Foo$/usi 的正则表达式。

(参见 https://www.mediawiki.org/wiki/Extension:TitleBlacklist#Block_list)。

英文:

Your downloaded copy of https://meta.wikimedia.org/wiki/Spam_blacklist (blacklist.txt) is corrupt. The dangling question marks are non-ASCII characters, e.g. \bfacebo(?:o[ob]|?o)k\.com\b is actually \bfacebo(?:o[ob]|ıo)k\.com\b. Note the dotless "ı".

Download https://meta.wikimedia.org/wiki/Spam_blacklist?action=raw and take into account that it is UTF-8.

And you may want to pass Unicode flag to the regular expressions. Also take into account that:
> What is referred to here as regular expressions are not proper regular expressions, but rather subpatterns that are inserted into a hard-coded regular expression. i.e. the subpattern Foo from above would create a regular expression like /^Foo$/usi.

(see https://www.mediawiki.org/wiki/Extension:TitleBlacklist#Block_list).

huangapple
  • 本文由 发表于 2020年10月12日 10:40:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/64311009.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定