2020年7月28日 01:15:56go评论125阅读模式

英文:

Is there a regular expession or similar simple approach for checking whether a MediaWiki PageTitle is valid?

问题

以下是您提供的内容的中文翻译：

https://www.mediawiki.org/wiki/Manual:Page_title 中列出了一些关于MediaWiki页面标题不得包含的条件。看起来使用这种方法来检查字符串是否是有效的MediaWiki页面标题并不容易。

有什么正则表达式或类似的简单方法来检查页面标题是否有效？

到目前为止，我找到的最好方法是一些Java代码（来自https://github.com/MER-C/wiki-java/blob/master/src/org/wikipedia/Wiki.java）。但是我的目标语言是Python。

    /**
     * 用于规范化MediaWiki标题的便捷方法。（将所有下划线转换为空格）。
     * @param s 要规范化的字符串
     * @return 规范化后的字符串
     * @throws IllegalArgumentException 如果标题无效
     * @throws IOException 如果发生网络错误（罕见）
     * @since 0.27
     */
    public String normalize(String s) throws IOException
    {
        // 移除前导冒号
        if (s.startsWith(":"))
            s = s.substring(1);
        if (s.isEmpty())
            return s;
        int ns = namespace(s);
        // 本地化命名空间名称
        if (ns != MAIN_NAMESPACE)
        {
            int colon = s.indexOf(":");
            s = namespaceIdentifier(ns) + s.substring(colon);
        }
        char[] temp = s.toCharArray();
        if (wgCapitalLinks)
        {
            // 将实际标题中的第一个字符转换为大写
            if (ns == MAIN_NAMESPACE)
                temp[0] = Character.toUpperCase(temp[0]);
            else
            {
                int index = namespaceIdentifier(ns).length() + 1; // + 1 用于冒号
                temp[index] = Character.toUpperCase(temp[index]);
            }
        }
        for (int i = 0; i < temp.length; i++)
        {
            switch (temp[i])
            {
                // 非法字符
                case '{':
                case '}':
                case '<':
                case '>':
                case '[':
                case ']':
                case '|':
                    throw new IllegalArgumentException(s + " 是非法标题");
                case '_':
                    temp[i] = ' ';
                    break;
            }
        }
        // https://www.mediawiki.org/wiki/Unicode_normalization_considerations
        String temp2 = new String(temp).trim().replaceAll("\\s+", " ");
        return Normalizer.normalize(temp2, Normalizer.Form.NFC);
    }

英文:

https://www.mediawiki.org/wiki/Manual:Page_title states a lot of conditions for what a MediaWiki pageTitle may not contain. It looks like checking whether a string is a valid MediaWiki PageTitle is not quite easy with this approach.

What would be a regular expression or similar simple approach to check whether a page Title is valid?

The best i could find so far is some Java Code (from https://github.com/MER-C/wiki-java/blob/master/src/org/wikipedia/Wiki.java). My target language is python, though.

    /**
     *  Convenience method for normalizing MediaWiki titles. (Converts all
     *  underscores to spaces).
     *  @param s the string to normalize
     *  @return the normalized string
     *  @throws IllegalArgumentException if the title is invalid
     *  @throws IOException if a network error occurs (rare)
     *  @since 0.27
     */
    public String normalize(String s) throws IOException
    {
        // remove leading colon
        if (s.startsWith(&quot;:&quot;))
            s = s.substring(1);
        if (s.isEmpty())
            return s;
        int ns = namespace(s);
        // localize namespace names
        if (ns != MAIN_NAMESPACE)
        {
            int colon = s.indexOf(&quot;:&quot;);
            s = namespaceIdentifier(ns) + s.substring(colon);
        }
        char[] temp = s.toCharArray();
        if (wgCapitalLinks)
        {
            // convert first character in the actual title to upper case
            if (ns == MAIN_NAMESPACE)
                temp[0] = Character.toUpperCase(temp[0]);
            else
            {
                int index = namespaceIdentifier(ns).length() + 1; // + 1 for colon
                temp[index] = Character.toUpperCase(temp[index]);
            }
        }
        for (int i = 0; i &lt; temp.length; i++)
        {
            switch (temp[i])
            {
                // illegal characters
                case &#39;{&#39;:
                case &#39;}&#39;:
                case &#39;&lt;&#39;:
                case &#39;&gt;&#39;:
                case &#39;[&#39;:
                case &#39;]&#39;:
                case &#39;|&#39;:
                    throw new IllegalArgumentException(s + &quot; is an illegal title&quot;);
                case &#39;_&#39;:
                    temp[i] = &#39; &#39;;
                    break;
            }
        }
        // https://www.mediawiki.org/wiki/Unicode_normalization_considerations
        String temp2 = new String(temp).trim().replaceAll(&quot;\\s+&quot;, &quot; &quot;);
        return Normalizer.normalize(temp2, Normalizer.Form.NFC);
    }

答案1

得分: 1

以下是翻译好的部分：

如果您可以调用目标维基API进行规范化处理，那么这是一个规范化页面标题的API调用示例：

https://en.wikipedia.org/w/api.php?action=query&titles=Project:article_B&format=json&formatversion=2

规范化后的标题将位于 /query/normalized/0/to。您可以一次发送多个要规范化的标题，用 | 分隔它们。

此示例摘自 https://www.mediawiki.org/wiki/API:Query#Example_2:_Title_normalization。

英文:

If you can call the target wiki API to do the normalisation, then this is an example of API call that normalises page titles:

https://en.wikipedia.org/w/api.php?action=query&titles=Project:article_B&format=json&formatversion=2

The normalised title ill be in /query/normalized/0/to. You can send several titles to normalise at once separating them with |.

The example is taken from https://www.mediawiki.org/wiki/API:Query#Example_2:_Title_normalization.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有没有检查MediaWiki页面标题是否有效的正则表达式或类似简单方法？

问题

答案1

“Spark KMeans 生成确定性结果，而非随机结果。”

我有两个相同的项目，但只能在其中一个项目中运行Selenium webdriver。

用Python读取具有奇怪分隔符的文件

递归排序：快速排序语法

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。