英文:
Java InputStream reading the chracter \" instead of showing as " for html content file
问题
我想读取一个以字符串格式存储的HTML内容文件
文件内容如下
<table class=\"relative-table\" style
但是当我在Java中进行检查时,显示如下
<table class="\&quot;relative-table\&quot;" style=
我的期望是如下
<table class="relative-table" style
以下是我的Java代码:
File file = new File("C:\\Users\\table.xml");
Document doc;
try {
InputStream stream = new FileInputStream(file);
doc = Jsoup.parse(stream, null, "UTF-8", Parser.xmlParser());
} catch (IOException e) {
e.printStackTrace();
}
示例源文件
<table class=\"relative-table\" style=\"width: 100.0%;\">
<colgroup>
<col style=\"width: 10%;\" />
<col style=\"width: 20%;\" />
<col style=\"width: 70%;\" />
</colgroup>
<tbody>
<tr>
........
英文:
I want to read a HTML content file which in a string format
The file content as below
<table class=\"relative-table\" style
But when I inspect in java it showing as below
<table class="\&quot;relative-table\&quot;" style=
My expectation was to as below
<table class="relative-table" style
Below is my Java code:
File file = new File("C:\\Users\\table.xml");
Document doc;
try {
InputStream stream = new FileInputStream(file);
doc = Jsoup.parse(stream, null, "UTF-8", Parser.xmlParser());
} catch (IOException e) {
e.printStackTrace();
}
Sample source file
<table class=\"relative-table\" style=\"width: 100.0%;\">
<colgroup>
<col style=\"width: 10%;\" />
<col style=\"width: 20%;\" />
<col style=\"width: 70%;\" />
</colgroup>
<tbody>
<tr>
........
答案1
得分: 1
问题似乎是那些反斜杠不应该出现在文件内容中。(在Java字符串中,"... \" ... "
中的反斜杠加引号只会表示引号字符。)因此,引号被视为未引用的HTML属性的一部分,并实际上被修复为HTML/XML实体&quot;
。
Path file = Paths.get("C:\\Users\\table.xml");
String content = new String(Files.readAllBytes(file), StandardCharsets.UTF_8);
content = content.replace("\\\"", "\"");
ByteArrayInputStream bais = new ByteArrayInputStream(
content.getBytes(StandardCharsets.UTF_8));
Document doc;
try {
doc = Jsoup.parse(bais, null, "UTF-8", Parser.xmlParser());
} catch (IOException e) {
e.printStackTrace();
}
这个不太美观的补丁有一个缺陷:不能确定是否还涉及到其他内容。
英文:
The problem seems that those backslashes do not belong in the file content. (In a java String "... \" ... "
backslash+quote would simply represent the quote char.) Hence the quote is seen as part on an unquoted HTML attribute, and actually "repaired" as HTML/XML entity &quot;
.
Path file = Paths.get("C:\\Users\\table.xml");
String content = new String(Files.readAllBytes(file), StandardCharsets.UTF_8);
content = content.replace("\\\"", "\"");
ByteArrayInputStream bais = new ByteArrayInputStream(
content.getBytes(StandardCharsets.UTF_8));
Document doc;
try {
doc = Jsoup.parse(bais, null, "UTF-8", Parser.xmlParser());
} catch (IOException e) {
e.printStackTrace();
}
This ugly patch has one flaw: one cannot be sure, that not more is concerned.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论