英文:
Skip BOM using BOMInputStream and retrieve byte[] without BOM
问题
以下是翻译好的内容:
我有一个带有BOM(UTF-8编码)的xml文件。该文件以byte[]
形式提供。我需要跳过BOM,然后将这些字节转换为字符串。
这是我现在的代码外观:
BOMInputStream bomInputStream = new BOMInputStream(new ByteArrayInputStream(requestDTO.getFile())); // getFile()返回byte[]
bomInputStream.skip(bomInputStream.hasBOM() ? bomInputStream.getBOM().length() : 0);
validationService.validate(new String(/*无BOM的BYTE[]*/)); // 抛出NullPointerException
我正在使用BOMInputStream。我有几个问题。第一个问题是bomInputStream.hasBOM()
返回false
。第二个问题是,我不确定如何从后来的bomInputStream
中检索byte[]
,因为bomInputStream.getBOM().getBytes()
会抛出NullPointerException。感谢任何帮助!
BOMInputStream文档链接:
https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/input/BOMInputStream.html
英文:
I have an xml file with BOM(UTF-8 encoding). The file comes as a byte[]
. I need to skip the BOM and later convert these bytes into a String.
This is how my code looks like now:
BOMInputStream bomInputStream = new BOMInputStream(new ByteArrayInputStream(requestDTO.getFile())); // getFile() returns byte[]
bomInputStream.skip(bomInputStream.hasBOM() ? bomInputStream.getBOM().length() : 0);
validationService.validate(new String(/*BYTE[] WITHOUT BOM*/)); // throws NullPointerException
I'm using BOMInputStream. I have couple of issues. The first one is that the bomInputStream.hasBOM()
returns false
. The second one, I'm not sure how to retrive the byte[]
from bomInputStream
later on, because bomInputStream.getBOM().getBytes()
throws NullPointerException. Thanks for any help!
BOMInputStream documentation link:
https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/input/BOMInputStream.html
答案1
得分: 1
构造函数不带布尔型的include参数,则排除了BOM,因此hasBOM()
返回false,不会包含BOM,而且字符串中也不会包含BOM。然后getBOM()
返回null!
byte[] xml = requestDTO.getFile();
int bomLength = 0;
Charset charset = StandardCharsets.UTF_8;
try (BOMInputStream bommedInputStream = new BOMInputStream(new ByteArrayInputStream(xml),
true)) {
if (bommedInputStream.hasBOM()) {
bomLength = bommedInputStream.getBOM().length();
charset = Charset.forName(bommedInputStream.getBOMCharsetName());
} else {
// 处理 <?xml ... encoding="..." ... ?>。
String t = new String(xml, StandardCharsets.ISO_8859_1));
String enc = t.replace("(?sm).*<\\?xml.*\\bencoding=\"([^\"]+)\".*\\?>.*$", "$1");
... 或类似的操作来填充字符集 ...
}
}
String s = new String(xml, charset).replaceFirst("^\uFEFF", ""); // 移除BOM。
validationService.validate(s);
可以使用bomLength来去除BOM。BOMInputStream 可以为我们提供许多UTF变体的字符集。
没有编码/字符集的字符串构造函数(正如你所用的)将使用默认的平台编码。由于BOM是Unicode代码点 U+FEFF,你可以简单地传递 `"\uFEFF"`。
英文:
The constructor without boolean include parameter excludes the BOM, hence hasBOM()
returns false, and no BOM will be included. And the String will not contain a BOM.
Then getBOM()
returns null!
byte[] xml = requestDTO.getFile();
int bomLength = 0;
Charset charset = StandardCharsets.UTF_8;
try (BOMInputStream bommedInputStream = new BOMInputStream(new ByteArrayInputStream(xml),
true)) {
if (bommedInputStream.hasBOM()) {
bomLength = bommedInputStream.getBOM().length();
charset = Charset.forName(bommedInputStream.getBOMCharsetName());
} else {
// Handle <?xml ... encoding="..." ... ?>.
String t = new String(xml, StandardCharsets.ISO_8859_1));
String enc = t.replace("(?sm).*<\\?xml.*\\bencoding=\"([^\"]+)\".*\\?>.*$", "$1");
... or such to fill charset ...
}
}
String s = new String(xml, charset).replaceFirst("^\uFEFF", ""); // Remove BOM.
validationService.validate(s);
Removing the BOM could be done using the bomLength. BOMInputStream can give us the charset for the many UTF variants.
The String constructor without encoding/charset (as you used) will use the default platform encoding. As the BOM is Unicode code pointer U+FEFF, you can simply pass "\uFEFF"
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论