英文:
Weird Exception from SgmlReader
问题
使用SgmlReader在C#中解析HTML文件。我使用了他们网站上提供的示例代码:
using (reader = File.OpenText(fileName))
{
try
{
xmlDoc = fromHTML(reader);
}
catch(Exception ex)
{
return ReturnedCode.ErrorOpeningHTMLFile;
}
}
private XmlDocument fromHTML(TextReader reader)
{
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
// create document
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.Load(sgmlReader);
return doc;
}
该代码长时间运行而没有任何问题。然而,最近它开始在doc.Load(sgmlReader)
行引发以下异常:
有效的UTF32值介于0x000000和0x10ffff之间,应不包括替代码点值(0x00d800 ~ 0x00dfff)。\r\n参数名称:utf32
我能够将问题缩小到HTML文件中以下内容。如果我尝试解析包含以下代码的文件,就会引发异常。
<html>
<br>�
</html>
如果我删除第二行中的和符号,代码将正常工作。
这里发生了什么以及如何修复它?我不能简单地删除文件中的所有和符号。
英文:
I'm using SgmlReader to parse HTML files in C#. I'm using the sample code provided on their website:
using (reader = File.OpenText(fileName))
{
try
{
xmlDoc = fromHTML(reader);
}
catch(Exception ex)
{
return ReturnedCode.ErrorOpeningHTMLFile;
}
}
private XmlDocument fromHTML(TextReader reader)
{
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
// create document
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.Load(sgmlReader);
return doc;
}
The code has been running for a long time without any issue. However, recently it started throwing the following exception at doc.Load(sgmlReader)
line:
> A valid UTF32 value is between 0x000000 and 0x10ffff, inclusive, and should not include surrogate codepoint values (0x00d800 ~ 0x00dfff).\r\nParameter name: utf32
I was able to narrow down the problem to the below content of the HTML file. If I try to parse a file containing the below code, the exception will be thrown.
<html>
<br>&#121669935008
</html>
If I remove the ampersand in the second line, the code will work normally.
Any idea what's happening here and how can I fix it? I cannot simply remove all the ampersands in the files.
答案1
得分: 0
"&
"字符在XML中是一个转义字符,因此每当数据中出现"&
"时,您需要在其末尾添加它的Unicode值,以确保没有XML解析错误。您可以通过将所有"&
"替换为"&#038;
"来实现这一点。
英文:
The &
character is an escape character in XML, so you need to tack on it's unicode value at the end, every time &
appears in your data, thus ensuring that there are no XML parsing errors. How you can do this is replace all &
with &#038;
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论