奇怪的SgmlReader异常

huangapple go评论64阅读模式
英文:

Weird Exception from SgmlReader

问题

使用SgmlReader在C#中解析HTML文件。我使用了他们网站上提供的示例代码:

using (reader = File.OpenText(fileName))
{
    try
    {
        xmlDoc = fromHTML(reader);
    }
    catch(Exception ex)
    {
        return ReturnedCode.ErrorOpeningHTMLFile;
    }
}

private XmlDocument fromHTML(TextReader reader)
{
    Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
    sgmlReader.DocType = "HTML";
    sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
    sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
    sgmlReader.InputStream = reader;
    //  create document
    XmlDocument doc = new XmlDocument();
    doc.PreserveWhitespace = true;
    doc.Load(sgmlReader);
    return doc;
}

该代码长时间运行而没有任何问题。然而,最近它开始在doc.Load(sgmlReader)行引发以下异常:

有效的UTF32值介于0x000000和0x10ffff之间,应不包括替代码点值(0x00d800 ~ 0x00dfff)。\r\n参数名称:utf32

我能够将问题缩小到HTML文件中以下内容。如果我尝试解析包含以下代码的文件,就会引发异常。

<html>
<br>&#121669935008
</html>

如果我删除第二行中的和符号,代码将正常工作。

这里发生了什么以及如何修复它?我不能简单地删除文件中的所有和符号。

英文:

I'm using SgmlReader to parse HTML files in C#. I'm using the sample code provided on their website:

using (reader = File.OpenText(fileName))
        {
            try
            {
                xmlDoc = fromHTML(reader);
            }
            catch(Exception ex)
            {
                return ReturnedCode.ErrorOpeningHTMLFile;
            }
        }
private XmlDocument fromHTML(TextReader reader)
    {
        Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
        sgmlReader.DocType = &quot;HTML&quot;;
        sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
        sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
        sgmlReader.InputStream = reader;
        //  create document
        XmlDocument doc = new XmlDocument();
        doc.PreserveWhitespace = true;
        doc.Load(sgmlReader);
        return doc;
    }

The code has been running for a long time without any issue. However, recently it started throwing the following exception at doc.Load(sgmlReader) line:

> A valid UTF32 value is between 0x000000 and 0x10ffff, inclusive, and should not include surrogate codepoint values (0x00d800 ~ 0x00dfff).\r\nParameter name: utf32

I was able to narrow down the problem to the below content of the HTML file. If I try to parse a file containing the below code, the exception will be thrown.

&lt;html&gt;
&lt;br&gt;&amp;#121669935008
&lt;/html&gt;

If I remove the ampersand in the second line, the code will work normally.

Any idea what's happening here and how can I fix it? I cannot simply remove all the ampersands in the files.

答案1

得分: 0

"&amp;"字符在XML中是一个转义字符,因此每当数据中出现"&amp;"时,您需要在其末尾添加它的Unicode值,以确保没有XML解析错误。您可以通过将所有"&amp;"替换为"&amp;#038;"来实现这一点。

英文:

The &amp; character is an escape character in XML, so you need to tack on it's unicode value at the end, every time &amp; appears in your data, thus ensuring that there are no XML parsing errors. How you can do this is replace all &amp; with &amp;#038;.

huangapple
  • 本文由 发表于 2023年2月24日 02:11:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75548743.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定