问题

使用SgmlReader在C#中解析HTML文件。我使用了他们网站上提供的示例代码：

using (reader = File.OpenText(fileName))
{
    try
    {
        xmlDoc = fromHTML(reader);
    }
    catch(Exception ex)
    {
        return ReturnedCode.ErrorOpeningHTMLFile;
    }
}

private XmlDocument fromHTML(TextReader reader)
{
    Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
    sgmlReader.DocType = "HTML";
    sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
    sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
    sgmlReader.InputStream = reader;
    //  create document
    XmlDocument doc = new XmlDocument();
    doc.PreserveWhitespace = true;
    doc.Load(sgmlReader);
    return doc;
}

该代码长时间运行而没有任何问题。然而，最近它开始在doc.Load(sgmlReader)行引发以下异常：

有效的UTF32值介于0x000000和0x10ffff之间，应不包括替代码点值(0x00d800 ~ 0x00dfff)。\r\n参数名称：utf32

我能够将问题缩小到HTML文件中以下内容。如果我尝试解析包含以下代码的文件，就会引发异常。

<html>
<br>&#121669935008
</html>

如果我删除第二行中的和符号，代码将正常工作。

这里发生了什么以及如何修复它？我不能简单地删除文件中的所有和符号。

英文:

I'm using SgmlReader to parse HTML files in C#. I'm using the sample code provided on their website:

using (reader = File.OpenText(fileName))
        {
            try
            {
                xmlDoc = fromHTML(reader);
            }
            catch(Exception ex)
            {
                return ReturnedCode.ErrorOpeningHTMLFile;
            }
        }
private XmlDocument fromHTML(TextReader reader)
    {
        Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
        sgmlReader.DocType = &quot;HTML&quot;;
        sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
        sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
        sgmlReader.InputStream = reader;
        //  create document
        XmlDocument doc = new XmlDocument();
        doc.PreserveWhitespace = true;
        doc.Load(sgmlReader);
        return doc;
    }

The code has been running for a long time without any issue. However, recently it started throwing the following exception at doc.Load(sgmlReader) line:

> A valid UTF32 value is between 0x000000 and 0x10ffff, inclusive, and should not include surrogate codepoint values (0x00d800 ~ 0x00dfff).\r\nParameter name: utf32

I was able to narrow down the problem to the below content of the HTML file. If I try to parse a file containing the below code, the exception will be thrown.

&lt;html&gt;
&lt;br&gt;&amp;#121669935008
&lt;/html&gt;

If I remove the ampersand in the second line, the code will work normally.

Any idea what's happening here and how can I fix it? I cannot simply remove all the ampersands in the files.

答案1

得分: 0

"&"字符在XML中是一个转义字符，因此每当数据中出现"&"时，您需要在其末尾添加它的Unicode值，以确保没有XML解析错误。您可以通过将所有"&"替换为"&#038;"来实现这一点。

英文:

The & character is an escape character in XML, so you need to tack on it's unicode value at the end, every time & appears in your data, thus ensuring that there are no XML parsing errors. How you can do this is replace all & with &#038;.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

奇怪的SgmlReader异常

问题

答案1

简化linq语句 (Sonar2971)

C# 传递类作为引用，必须声明主体，因为它没有标记为抽象、外部或部分。

在C#中，在一个JSON对象中定义Cache-Control头部。

Unity2D运动-速度不正确

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论