英文:
Java byte array replace all occurrences of byte-array/string
问题
有没有一种“已经实现的”(不是手动的)方法来替换字节数组/字符串在字节数组中的所有出现?我有一个需要创建包含与平台相关的文本的字节数组的情况(Linux(换行符),Windows(回车+换行符))。我知道这样的任务可以手动实现,但我正在寻找开箱即用的解决方案。请注意,这些字节数组很大,解决方案在我的情况下需要考虑性能。还请注意,我正在处理大量这些字节数组。
我的当前方法:
var byteArray = resourceLoader.getResource("classpath:File.txt").getInputStream().readAllBytes();
byteArray = new String(byteArray)
.replaceAll((schemeModel.getOsType() == SystemTypes.LINUX) ? "\r\n" : "\n",
(schemeModel.getOsType() == SystemTypes.LINUX) ? "\n" : "\r\n"
).getBytes(StandardCharsets.UTF_8);
这种方法在性能上不明智,因为它创建了新的字符串并使用正则表达式查找出现次数。我知道手动实现会需要查看字节序列,因为Windows编码的原因。因此,手动实现还需要重新分配(如果需要的话)。
Apache Common Lang Utils 包含 ArrayUtils
,其中包含方法 byte[] removeAllOccurrences(byte[] array, byte element)
。是否有第三方库包含类似的方法,用于替换字节数组/字符串在字节数组中的所有出现?
编辑:正如 @saka1029 在评论中提到的,我的方法对于Windows操作系统类型不起作用。因为这个bug,我需要坚持使用正则表达式,如下所示:
(schemeModel.getOsType() == SystemTypes.LINUX) ? "\\r\\n" : "[?:^\\r]\\n",
(schemeModel.getOsType() == SystemTypes.LINUX) ? "\n" : "\r\n")
这种方式,对于Windows情况,只会搜索没有前导\r
的\n
出现,并将其替换为\r\n
(正则表达式已修改为在\n
处查找组,而不是在[^\r]\n位置直接查找,否则会提取行的最后一个字母)。这种工作流程无法使用传统方法实现,因此使这个问题无效。
英文:
Is there any "already-implemented" (not manual) way to replace all occurrences of single byte-array/string inside byte array ? I have a case where i need to create byte array containing platform dependent text (Linux (line feed), Windows (carriage return + line feed)). I know such task can be implemented manually but i am looking for out-of-the-box solution. Note that these byte array's are large and solution needs to be performance wise in my case. Also note that i am processing large amount of these byte-arrays.
My current approach:
var byteArray = resourceLoader.getResource("classpath:File.txt").getInputStream().readAllBytes();
byteArray = new String(byteArray)
.replaceAll((schemeModel.getOsType() == SystemTypes.LINUX) ? "\r\n" : "\n",
(schemeModel.getOsType() == SystemTypes.LINUX) ? "\n" : "\r\n"
).getBytes(StandardCharsets.UTF_8);
This approach is not performance wise because of creating new Strings and using regex to find occurrences. I know that manual implementation would require looking at sequence of bytes because of Windows encoding. Manual implementation would therefore also require reallocation (if needed) as well.
Appache common lang utils contains ArrayUtils
which contains method
byte[] removeAllOccurrences(byte[] array, byte element)
. Is there any third party library which contains similar method for replacing ALL byte-arrays/strings occurrences inside byte array ??
Edit: As @saka1029 mentioned in comments, my approach doesn't work for Windows OS type. Because of this bug i need to stick with regexes as following:
(schemeModel.getOsType() == SystemTypes.LINUX) ? "\\r\\n" : "[?:^\\r]\\n",
(schemeModel.getOsType() == SystemTypes.LINUX) ? "\n" : "\r\n")
This way, for windows case, only occurrences of '\n' without preceding '\r' are searched and replaced with '\r\n' (regex is modified to find group at '\n' not at [^\r]\n position directly otherwise last letter from line would be extracted as well). Such workflow cannot be implemented using conventional methods thus invalidates this question.
答案1
得分: 1
如果你正在阅读文本,应将其视为文本,而不是字节。使用BufferedReader逐行读取文本,并插入你自己的换行序列。
String newline = schemeModel.getOsType() == SystemTypes.LINUX ? "\n" : "\r\n";
OutputStream out = /* ... */;
try (Writer writer = new BufferedWriter(
new OutputStreamWriter(out, StandardCharsets.UTF_8));
BufferedReader reader = new BufferedReader(
new InputStreamReader(
resourceLoader.getResource("classpath:File.txt").getInputStream(),
StandardCharsets.UTF_8))) {
String line;
while ((line = reader.readLine()) != null) {
writer.write(line);
writer.write(newline);
}
}
无需使用字节数组,并且仅使用了少量内存 - 仅需要容纳遇到的最长行所需的内存量。(我很少看到超过一千字节的长行文本,但即使一兆字节也是相当小的内存需求。)
如果你要“修复”zip条目,OutputStream可以是指向新ZipEntry的ZipOutputStream:
String newline = schemeModel.getOsType() == SystemTypes.LINUX ? "\n" : "\r\n";
ZipInputStream oldZip = /* ... */;
ZipOutputStream newZip = /* ... */;
ZipEntry entry;
while ((entry = oldZip.getNextEntry()) != null) {
newZip.putNextEntry(entry);
// 我们只想在文本文件中修复换行符。
if (!entry.getName().matches(".*\\." +
"(?i:txt|x?html?|xml|json|[ch]|cpp|cs|py|java|properties|jsp)")) {
oldZip.transferTo(newZip);
continue;
}
Writer writer = new BufferedWriter(
new OutputStreamWriter(newZip, StandardCharsets.UTF_8));
BufferedReader reader = new BufferedReader(
new InputStreamReader(oldZip, StandardCharsets.UTF_8));
String line;
while ((line = reader.readLine()) != null) {
writer.write(line);
writer.write(newline);
}
writer.flush();
}
一些注意事项:
- 你是否故意忽略了Mac(以及其他既不是Windows也不是Linux的操作系统)?你应该假设除Windows外的所有操作系统都使用
\n
。也就是说,schemeModel.getOsType() == SystemTypes.WINDOWS ? "\r\n" : "\n"
- 你的代码包含
new String(byteArray)
,这假定你的资源的字节使用运行程序的系统的默认字符集。我怀疑这不是你的本意;我已经在InputStreamReader的构造中添加了StandardCharsets.UTF_8
以解决这个问题。如果你确实打算使用默认字符集读取字节,可以删除第二个构造函数参数。
英文:
If you’re reading text, you should treat it as text, not as bytes. Use a BufferedReader to read the lines one by one, and insert your own newline sequences.
String newline = schemeModel.getOsType() == SystemTypes.LINUX ? "\n" : "\r\n";
OutputStream out = /* ... */;
try (Writer writer = new BufferedWriter(
new OutputStreamWriter(out, StandardCharsets.UTF_8));
BufferedReader reader = new BufferedReader(
new InputStreamReader(
resourceLoader.getResource("classpath:File.txt").getInputStream(),
StandardCharsets.UTF_8))) {
String line;
while ((line = reader.readLine()) != null) {
writer.write(line);
writer.write(newline);
}
}
No byte array needed, and you are using only a small amount of memory—the amount needed to hold the largest line encountered. (I rarely see text with a line longer than one kilobyte, but even one megabyte would be a pretty small memory requirement.)
If you are “fixing” zip entries, the OutputStream can be a ZipOutputStream pointing to a new ZipEntry:
String newline = schemeModel.getOsType() == SystemTypes.LINUX ? "\n" : "\r\n";
ZipInputStream oldZip = /* ... */;
ZipOutputStream newZip = /* ... */;
ZipEntry entry;
while ((entry = oldZip.getNextEntry()) != null) {
newZip.putNextEntry(entry);
// We only want to fix line endings in text files.
if (!entry.getName().matches(".*\\." +
"(?i:txt|x?html?|xml|json|[ch]|cpp|cs|py|java|properties|jsp)")) {
oldZip.transferTo(newZip);
continue;
}
Writer writer = new BufferedWriter(
new OutputStreamWriter(newZip, StandardCharsets.UTF_8));
BufferedReader reader = new BufferedReader(
new InputStreamReader(oldZip, StandardCharsets.UTF_8));
String line;
while ((line = reader.readLine()) != null) {
writer.write(line);
writer.write(newline);
}
writer.flush();
}
Some notes:
- Are you deliberately ignoring Macs (and other operating systems which are neither Windows nor Linux)? You should assume
\n
for everything except Windows. That is,schemeModel.getOsType() == SystemTypes.WINDOWS ? "\r\n" : "\n"
- Your code contains
new String(byteArray)
which assumes the bytes of your resource use the default Charset of the system on which your program is running. I suspect this is not what you intended; I have addedStandardCharsets.UTF_8
to the construction of the InputStreamReader to address this. If you really meant to read the bytes using the default Charset, you can remove that second constructor argument.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论