有没有检查MediaWiki页面标题是否有效的正则表达式或类似简单方法?

huangapple go评论83阅读模式
英文:

Is there a regular expession or similar simple approach for checking whether a MediaWiki PageTitle is valid?

问题

以下是您提供的内容的中文翻译:

https://www.mediawiki.org/wiki/Manual:Page_title 中列出了一些关于MediaWiki页面标题不得包含的条件。看起来使用这种方法来检查字符串是否是有效的MediaWiki页面标题并不容易。

有什么正则表达式或类似的简单方法来检查页面标题是否有效?

到目前为止,我找到的最好方法是一些Java代码(来自https://github.com/MER-C/wiki-java/blob/master/src/org/wikipedia/Wiki.java)。但是我的目标语言是Python。

    /**
     * 用于规范化MediaWiki标题的便捷方法。(将所有下划线转换为空格)。
     * @param s 要规范化的字符串
     * @return 规范化后的字符串
     * @throws IllegalArgumentException 如果标题无效
     * @throws IOException 如果发生网络错误(罕见)
     * @since 0.27
     */
    public String normalize(String s) throws IOException
    {
        // 移除前导冒号
        if (s.startsWith(":"))
            s = s.substring(1);
        if (s.isEmpty())
            return s;

        int ns = namespace(s);
        // 本地化命名空间名称
        if (ns != MAIN_NAMESPACE)
        {
            int colon = s.indexOf(":");
            s = namespaceIdentifier(ns) + s.substring(colon);
        }
        char[] temp = s.toCharArray();
        if (wgCapitalLinks)
        {
            // 将实际标题中的第一个字符转换为大写
            if (ns == MAIN_NAMESPACE)
                temp[0] = Character.toUpperCase(temp[0]);
            else
            {
                int index = namespaceIdentifier(ns).length() + 1; // + 1 用于冒号
                temp[index] = Character.toUpperCase(temp[index]);
            }
        }

        for (int i = 0; i < temp.length; i++)
        {
            switch (temp[i])
            {
                // 非法字符
                case '{':
                case '}':
                case '<':
                case '>':
                case '[':
                case ']':
                case '|':
                    throw new IllegalArgumentException(s + " 是非法标题");
                case '_':
                    temp[i] = ' ';
                    break;
            }
        }
        // https://www.mediawiki.org/wiki/Unicode_normalization_considerations
        String temp2 = new String(temp).trim().replaceAll("\\s+", " ");
        return Normalizer.normalize(temp2, Normalizer.Form.NFC);
    }
英文:

https://www.mediawiki.org/wiki/Manual:Page_title states a lot of conditions for what a MediaWiki pageTitle may not contain. It looks like checking whether a string is a valid MediaWiki PageTitle is not quite easy with this approach.

What would be a regular expression or similar simple approach to check whether a page Title is valid?

The best i could find so far is some Java Code (from https://github.com/MER-C/wiki-java/blob/master/src/org/wikipedia/Wiki.java). My target language is python, though.

    /**
     *  Convenience method for normalizing MediaWiki titles. (Converts all
     *  underscores to spaces).
     *  @param s the string to normalize
     *  @return the normalized string
     *  @throws IllegalArgumentException if the title is invalid
     *  @throws IOException if a network error occurs (rare)
     *  @since 0.27
     */
    public String normalize(String s) throws IOException
    {
        // remove leading colon
        if (s.startsWith(&quot;:&quot;))
            s = s.substring(1);
        if (s.isEmpty())
            return s;

        int ns = namespace(s);
        // localize namespace names
        if (ns != MAIN_NAMESPACE)
        {
            int colon = s.indexOf(&quot;:&quot;);
            s = namespaceIdentifier(ns) + s.substring(colon);
        }
        char[] temp = s.toCharArray();
        if (wgCapitalLinks)
        {
            // convert first character in the actual title to upper case
            if (ns == MAIN_NAMESPACE)
                temp[0] = Character.toUpperCase(temp[0]);
            else
            {
                int index = namespaceIdentifier(ns).length() + 1; // + 1 for colon
                temp[index] = Character.toUpperCase(temp[index]);
            }
        }

        for (int i = 0; i &lt; temp.length; i++)
        {
            switch (temp[i])
            {
                // illegal characters
                case &#39;{&#39;:
                case &#39;}&#39;:
                case &#39;&lt;&#39;:
                case &#39;&gt;&#39;:
                case &#39;[&#39;:
                case &#39;]&#39;:
                case &#39;|&#39;:
                    throw new IllegalArgumentException(s + &quot; is an illegal title&quot;);
                case &#39;_&#39;:
                    temp[i] = &#39; &#39;;
                    break;
            }
        }
        // https://www.mediawiki.org/wiki/Unicode_normalization_considerations
        String temp2 = new String(temp).trim().replaceAll(&quot;\\s+&quot;, &quot; &quot;);
        return Normalizer.normalize(temp2, Normalizer.Form.NFC);
    }

答案1

得分: 1

以下是翻译好的部分:

如果您可以调用目标维基API进行规范化处理,那么这是一个规范化页面标题的API调用示例:

规范化后的标题将位于 /query/normalized/0/to。您可以一次发送多个要规范化的标题,用 | 分隔它们。

此示例摘自 https://www.mediawiki.org/wiki/API:Query#Example_2:_Title_normalization

英文:

If you can call the target wiki API to do the normalisation, then this is an example of API call that normalises page titles:

The normalised title ill be in /query/normalized/0/to. You can send several titles to normalise at once separating them with |.

The example is taken from https://www.mediawiki.org/wiki/API:Query#Example_2:_Title_normalization.

huangapple
  • 本文由 发表于 2020年7月28日 01:15:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/63120292.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定