Java InputStream读取字符时显示为\"而不是作为"显示在HTML内容文件中。

huangapple go评论83阅读模式
英文:

Java InputStream reading the chracter \" instead of showing as " for html content file

问题

我想读取一个以字符串格式存储的HTML内容文件

文件内容如下

<table class=\"relative-table\" style

但是当我在Java中进行检查时,显示如下

<table class="\"relative-table\"" style=

我的期望是如下

<table class="relative-table" style

以下是我的Java代码:



		File file = new File("C:\\Users\\table.xml");
		Document doc;	
		try {
			InputStream stream = new FileInputStream(file); 
			doc = Jsoup.parse(stream, null, "UTF-8", Parser.xmlParser());
		} catch (IOException e) {
			e.printStackTrace();
		}

示例源文件

<table class=\"relative-table\" style=\"width: 100.0%;\">
  <colgroup>
    <col style=\"width: 10%;\" />
    <col style=\"width: 20%;\" />
    <col style=\"width: 70%;\" />
  </colgroup>
  <tbody>
    <tr>
   ........
英文:

I want to read a HTML content file which in a string format

The file content as below

<table class=\"relative-table\" style

But when I inspect in java it showing as below

<table class="\"relative-table\"" style=

My expectation was to as below

<table class="relative-table" style

Below is my Java code:



		File file = new File("C:\\Users\\table.xml");
		Document doc;	
		try {
			InputStream stream = new FileInputStream(file); 
			doc = Jsoup.parse(stream, null, "UTF-8", Parser.xmlParser());
		} catch (IOException e) {
			e.printStackTrace();
		}

Sample source file

<table class=\"relative-table\" style=\"width: 100.0%;\">
  <colgroup>
    <col style=\"width: 10%;\" />
    <col style=\"width: 20%;\" />
    <col style=\"width: 70%;\" />
  </colgroup>
  <tbody>
    <tr>
   ........

答案1

得分: 1

问题似乎是那些反斜杠不应该出现在文件内容中。(在Java字符串中,"... \" ... "中的反斜杠加引号只会表示引号字符。)因此,引号被视为未引用的HTML属性的一部分,并实际上被修复为HTML/XML实体"

Path file = Paths.get("C:\\Users\\table.xml");
String content = new String(Files.readAllBytes(file), StandardCharsets.UTF_8);
content = content.replace("\\\"", "\"");
ByteArrayInputStream bais = new ByteArrayInputStream(
        content.getBytes(StandardCharsets.UTF_8));

Document doc;
try {
    doc = Jsoup.parse(bais, null, "UTF-8", Parser.xmlParser());
} catch (IOException e) {
    e.printStackTrace();
}

这个不太美观的补丁有一个缺陷:不能确定是否还涉及到其他内容。

英文:

The problem seems that those backslashes do not belong in the file content. (In a java String "... \" ... " backslash+quote would simply represent the quote char.) Hence the quote is seen as part on an unquoted HTML attribute, and actually "repaired" as HTML/XML entity ".

    Path file = Paths.get("C:\\Users\\table.xml");
    String content = new String(Files.readAllBytes(file), StandardCharsets.UTF_8);
    content = content.replace("\\\"", "\"");
    ByteArrayInputStream bais = new ByteArrayInputStream(
            content.getBytes(StandardCharsets.UTF_8));

    Document doc;   
    try {
        doc = Jsoup.parse(bais, null, "UTF-8", Parser.xmlParser());
    } catch (IOException e) {
        e.printStackTrace();
    }

This ugly patch has one flaw: one cannot be sure, that not more is concerned.

huangapple
  • 本文由 发表于 2020年4月6日 20:24:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/61059803.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定