英文:
In Java, how to copy data from String to char[]/byte[] efficiently?
问题
我需要将许多不同的大型String str
的内容复制到一个静态字符数组中,并在一个对效率要求很高的任务中频繁使用这个数组,因此重要的是要避免分配过多的新空间。
基于上述原因,禁止使用str.toCharArray()
,因为它为每个字符串分配空间。
众所周知,charAt(i)
的速度较慢且比使用方括号[i]
更复杂。因此,我想使用byte[]
或char[]
。
一个好消息是,有一个str.getBytes(srcBegin, srcEnd, dst, dstBegin)
方法。但坏消息是它已经(或将要?)被弃用。
那么,我们如何完成这个要求高效的任务呢?
英文:
I need to copy many big and different String str
s' content to a static char array and use the array frequently in a efficiency-demanding job, thus it's important to avoid allocating too much new space.
For the reason above, str.toCharArray()
was banned, since it allocates space for every String.
As we all know, charAt(i)
is more slowly and more complex than using square brackets [i]
. So I want to use byte[]
or char[]
.
One good news is, there's a str.getBytes(srcBegin, srcEnd, dst, dstBegin)
. But the bad news is it was (or is to be?) deprecated.
So how can we finish this demanding job?
答案1
得分: 6
我相信你想要使用getChars(int, int, char[], int)
。这将把字符复制到指定的数组中,我期望它会以“尽可能高效”的方式执行。
除非你真的需要,否则应该避免在文本和二进制表示之间进行转换。除此之外,该转换本身可能会耗费大量时间。
英文:
I believe you want getChars(int, int, char[], int)
. That will copy the characters into the specified array, and I'd expect it to do it "as efficiently as reasonably possible".
You should avoid converting between text and binary representations unless you really need to. Aside from anything else, that conversion itself is likely to be time-consuming.
答案2
得分: 2
一个小小的盘点:
String
代表 Unicode 文本;可以进行 规范化 (java.text.Normalizer
)。int[]
代码点 代表 Unicode 符号。char[]
代表 Unicode UTF-16BE(每个字符占用 2 个字节),有时一个代码点需要 2 个字符:一个 代理对。byte[]
用于二进制数据。在处理大量 ASCII 或 Latin-1 文本时,使用 UTF-8 编码相对较紧凑。
处理过程可以在 ByteBuffer、CharBuffer、IntBuffer 上进行。
当涉及亚洲文字时,使用代码点 int 可能是最可行的。否则,字节似乎是最好的选择。
代码点(或字符)在利用 Character 类进行 Unicode 块和脚本分类、多种脚本中的数字、表情符号等方面也是有意义的。
性能方面,由于字节通常是最紧凑的,最好使用字节。可能使用 UTF-8。
无法高效处理内存分配。 应该使用带有 Charset 的 getBytes
。几乎总是会进行某种形式的转换。由于新的 Java 版本可以针对类似 Latin-1、ISO-8859-1 这样的编码保持字节数组而不是字符数组,甚至使用内部字符数组也不合适。并且会创建新的数组。
可以使用快速的 ByteBuffers 来处理。
另外,对于 语言分析,可以使用 数据库,也许是图数据库。至少可以利用并行处理的一些方法。
英文:
A small stocktaking:
String
does Unicode text; it can be normalized (java.text.Normalizer
).int[]
code points are Unicode symbolschar[]
is Unicode UTF-16BE (2 bytes per char), sometimes for a code point 2 chars are needed: a surrogate pair.byte[]
is for binary data. Holding Unicode text in UTF-8 is relative compact when there is much ASCII resp. Latin-1.
Processing might be done on a ByteBuffer, CharBuffer, IntBuffer.
When dealing with Asian scripts, int code points probably is most feasible.
Otherwise bytes seem best.
Code points (or chars) also make sense when the Character class is utilized for classification of Unicode blocks and scripts, digits in several scripts, emoji, whatever.
Performance would best be done in bytes as often most compact. UTF-8 probably.
One cannot efficiently deal with memory allocation. getBytes
should be used with a Charset. Almost always a kind of conversion happens. As new java versions can keep a byte array instead of a char array for an encoding like Latin-1, ISO-8859-1, even using an internal char array would not do. And new arrays are created.
What one can do, is using fast ByteBuffers.
Alternatively for lingual analysis one can use databases, maybe graph databases. At least something which can exploit parallelism.
答案3
得分: 1
你基本上受限于字符串类中提供的API,显然,那个被弃用的方法应该被getBytes()
替代(或者允许指定字符集的其他方法)。
换句话说,你提到的那个“有许多需要放入数组的大字符串”的问题不容易解决。
因此,一个明显的非答案是:审视你的设计。如果性能真的很关键,那么不要预先创建那么多大字符串!
换句话说:如果你的测量数据让你确信你确实有真正的性能问题,那么根据需要调整你的设计。也许你的字符串“输入”位置已经不使用字符串对象,而是在后续的性能方面对你更有效的某种方法。
当然,这会导致复杂且容易出错的解决方案,需要你自己做很多“内存管理”。因此,正如前面所说:首先进行测量。确保你真的有问题,而且问题实际上就在你认为的地方。
英文:
You are pretty much restricted to the APIs offered within the string class, and obviously, that deprecated method is supposed to be replaced with getBytes()
(or an alternative that allows to specify a charset.
In other words: that problem you are talking about "having many large strings, that need to go into arrays" can't be solved easily.
Thus a distinct non-answer: look into your design. If performance is really critical, then do not create those many large strings upfront!
In other words: if your measurements convince you that you do have real performance issue, then adapt your design as needed. Maybe there is a chance that in the place where your strings are "coming" in ... you already do not use String objects, but something that works better for you, later on, performance wise.
But of course: that will lead to a complex, error prone solution, where you do a lot of "memory management" yourself. Thus, as said: measure first. Ensure that you have a real problem, and it actually sits in the place you think it sits.
答案4
得分: 0
str.getBytes(srcBegin, srcEnd, dst, dstBegin)
的确已被弃用。相关文档建议使用getBytes()
替代。如果你之前需要使用str.getBytes(srcBegin, srcEnd, dst, dstBegin)
是因为有时不必转换整个字符串,我想你可以先使用substring()
,但我不确定这是否会对代码效率造成多大影响。或者如果你愿意将其存储在char[]
中,那么你可以使用getChars(int,int,char[],int),该方法并未被弃用。
英文:
str.getBytes(srcBegin, srcEnd, dst, dstBegin)
is indeed deprecated. The relevant documentation recommends getBytes()
instead. If you needed str.getBytes(srcBegin, srcEnd, dst, dstBegin)
because sometimes you don't have to convert the entire string I suppose you could substring()
first, but I'm not sure how badly that would impact your code's efficiency, if at all. Or if it's all the same to you if you store it in char[]
then you can use getChars(int,int,char[],int) which is not deprecated.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论