使用BOMInputStream跳过BOM,并获取不带BOM的byte[]。

huangapple go评论61阅读模式
英文:

Skip BOM using BOMInputStream and retrieve byte[] without BOM

问题

以下是翻译好的内容:

我有一个带有BOM(UTF-8编码)的xml文件。该文件以byte[]形式提供。我需要跳过BOM,然后将这些字节转换为字符串。

这是我现在的代码外观:

BOMInputStream bomInputStream = new BOMInputStream(new ByteArrayInputStream(requestDTO.getFile())); // getFile()返回byte[]

bomInputStream.skip(bomInputStream.hasBOM() ? bomInputStream.getBOM().length() : 0);

validationService.validate(new String(/*无BOM的BYTE[]*/)); // 抛出NullPointerException

我正在使用BOMInputStream。我有几个问题。第一个问题是bomInputStream.hasBOM()返回false。第二个问题是,我不确定如何从后来的bomInputStream中检索byte[],因为bomInputStream.getBOM().getBytes()会抛出NullPointerException。感谢任何帮助!

BOMInputStream文档链接:
https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/input/BOMInputStream.html

英文:

I have an xml file with BOM(UTF-8 encoding). The file comes as a byte[]. I need to skip the BOM and later convert these bytes into a String.

This is how my code looks like now:

BOMInputStream bomInputStream = new BOMInputStream(new ByteArrayInputStream(requestDTO.getFile())); // getFile() returns byte[]

bomInputStream.skip(bomInputStream.hasBOM() ? bomInputStream.getBOM().length() : 0);

validationService.validate(new String(/*BYTE[] WITHOUT BOM*/)); // throws NullPointerException

I'm using BOMInputStream. I have couple of issues. The first one is that the bomInputStream.hasBOM() returns false. The second one, I'm not sure how to retrive the byte[] from bomInputStream later on, because bomInputStream.getBOM().getBytes() throws NullPointerException. Thanks for any help!

BOMInputStream documentation link:
https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/input/BOMInputStream.html

答案1

得分: 1

构造函数不带布尔型的include参数,则排除了BOM,因此hasBOM()返回false,不会包含BOM,而且字符串中也不会包含BOM。然后getBOM()返回null!

byte[] xml = requestDTO.getFile();
int bomLength = 0;
Charset charset = StandardCharsets.UTF_8;
try (BOMInputStream bommedInputStream = new BOMInputStream(new ByteArrayInputStream(xml),
            true)) {
    if (bommedInputStream.hasBOM()) {
        bomLength = bommedInputStream.getBOM().length();
        charset = Charset.forName(bommedInputStream.getBOMCharsetName());
    } else {
        // 处理 <?xml ... encoding="..." ... ?>。
        String t = new String(xml, StandardCharsets.ISO_8859_1));
        String enc = t.replace("(?sm).*<\\?xml.*\\bencoding=\"([^\"]+)\".*\\?>.*$", "$1");
        ... 或类似的操作来填充字符集 ...
    }
}
String s = new String(xml, charset).replaceFirst("^\uFEFF", ""); // 移除BOM。
validationService.validate(s);

可以使用bomLength来去除BOMBOMInputStream 可以为我们提供许多UTF变体的字符集

没有编码/字符集的字符串构造函数正如你所用的将使用默认的平台编码由于BOM是Unicode代码点 U+FEFF你可以简单地传递 `"\uFEFF"`。
英文:

The constructor without boolean include parameter excludes the BOM, hence hasBOM() returns false, and no BOM will be included. And the String will not contain a BOM.
Then getBOM() returns null!

byte[] xml = requestDTO.getFile();
int bomLength = 0;
Charset charset = StandardCharsets.UTF_8;
try (BOMInputStream bommedInputStream = new BOMInputStream(new ByteArrayInputStream(xml),
            true)) {
    if (bommedInputStream.hasBOM()) {
        bomLength = bommedInputStream.getBOM().length();
        charset = Charset.forName(bommedInputStream.getBOMCharsetName());
    } else {
        // Handle <?xml ... encoding="..." ... ?>.
        String t = new String(xml, StandardCharsets.ISO_8859_1));
        String enc = t.replace("(?sm).*<\\?xml.*\\bencoding=\"([^\"]+)\".*\\?>.*$", "$1");
        ... or such to fill charset ...
    }
}
String s = new String(xml, charset).replaceFirst("^\uFEFF", ""); // Remove BOM.
validationService.validate(s);

Removing the BOM could be done using the bomLength. BOMInputStream can give us the charset for the many UTF variants.

The String constructor without encoding/charset (as you used) will use the default platform encoding. As the BOM is Unicode code pointer U+FEFF, you can simply pass "\uFEFF".

huangapple
  • 本文由 发表于 2020年10月7日 21:11:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/64244720.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定