Java 8,重复字符串造成的内存浪费

huangapple go评论72阅读模式
英文:

Java 8, memory wasted by duplicate strings

问题

我正在调查在运行在Java 8 JVM上的Grails 3.3.10服务器中的内存泄漏问题。我从一台内存不足的生产服务器中获取了一个堆转储,并使用JXRay进行了分析。HTML报告显示,一些内存被浪费在重复的字符串上,浪费率为19.6%。大部分浪费发生在空字符串""的重复上,而且主要是来自数据库读取操作。我对此有两个问题。

  1. 我是否应该开始对字符串进行内部化,还是这个操作过于昂贵而不值得?

  2. 我的大部分代码都涉及来自Elasticsearch的深度嵌套JSON结构,我不喜欢代码的脆弱性,因此我创建了一个小的帮助类,在从JSON中访问数据时可以避免拼写错误。

public static final class S {
    public static final String author      = "author";
    public static final String aspectRatio = "aspectRatio";
    public static final String userId      = "userId";
    ... 等等等等

这帮助我避免了类似这样的拼写错误:

    Integer userId = json.get("userid"); // 注意小写的i。这会返回null并且悄无声息地失败
    Integer userId = json.get(S.userId); // 如果我在这里出现拼写错误,编译器会提醒我。

我对此感到相当满意,但现在我对自己的决定开始产生疑虑。出于某种原因,这是一个不好的想法吗?我还没有看到其他人这样做过。这不应该导致创建重复的字符串,因为它们在我的解析代码中被创建一次,然后被引用,对吗?

英文:

I'm investigating a memory leak in my Grails 3.3.10 server that is running on a Java 8 JVM. I took a heap dump from a production server that was running low on memory and analysed it with JXRay. The html report says that some memory is wasted on duplicate strings with 19.6% overhead. Most of it is wasted on duplicates of the empty string "" and it is mostly coming from database reads. I have two questions about this.

  1. Should I start interning strings or is it too costly of an operation to be worth it?

  2. Quite a bit of my code deals with deeply nested JSON structures from elasticsearch and I didn't like the fragility of the code so I made a small helper class to avoid typos when accessing data from the json.

public static final class S {
    public static final String author      = "author";
    public static final String aspectRatio = "aspectRatio";
    public static final String userId      = "userId";
    ... etc etc

That helps me avoid typos like so:

    Integer userId = json.get("userid"); // Notice the lower case i. This returns null and fails silently
    Integer userId = json.get(S.userId); // If I make a typo here the compiler will tell me.

I was reasonably happy about this, but now I'm second guessing myself. Is this a bad idea for some reason? I haven't seen anyone else do this. That shouldn't cause any duplicate strings to be created because they are created once and then referenced in my parsing code, right?

答案1

得分: 4

问题在于使用一个包含字符串的类,这会违背语言设计的初衷。

类应该引入类型。一个提供不了任何效用的类型,因为它是一种“可以用字符串表达的一切”类型,很少有用。虽然在许多程序中会出现这种模式,但通常它们引入的行为比“所有的东西都在这里”要多。例如,区域数据库为不同语言提供替换字符串。

我会从划分合理的枚举开始。错误消息很容易被转换为枚举,枚举具有易于自动转换的字符串表示。这样,你就能获得“拼写错误检测”以及内置的分类。

 DiskErrors.DISK_NOT_FOUND
 Prompts.ASK_USER_NAME
 Prompts.ASK_USER_PASSWORD

这样的更改的副作用可能会达到你所期望的目标;但要注意,这类更改往往会导致可读性下降。

可读性不是你认为容易阅读的内容,而是一个从未使用过代码的人认为容易阅读的内容。

如果我看到一个问题,“未找到您选择的硬盘”,那么我会在代码库中查找一个字符串“未找到您选择的硬盘”。这可能会让我陷入两种情况:

  1. 在引发错误消息的代码块中。
  2. 在将该字符串映射到名称的表中。
  3. 在引发相同错误消息的许多代码块中。

通过表格映射,我可以进行第二次搜索,搜索名称的使用位置。这可能会让我陷入几种情况:

  1. 它在一个地方使用。
  2. 它在多个地方使用。

在一个地方使用时,会出现一种代码维护问题。您现在有一个在代码的任何其他部分中都没有使用的常量,而这个常量在与其使用的地方不相邻的地方进行维护。这意味着要进行任何需要全面理解影响的更改,某人必须记住远程常量的值,以了解逻辑更改是否应与更新的错误消息组合。引起额外错误机会的不是更新错误消息,而是它从正在处理的代码中被移除。

在多个位置使用时,我必须循环遍历所有匹配项,基本上与第一步中多个字符串匹配相同的工作量。因此,表格无法帮助我找到错误源,只会增加与修复问题无关的额外步骤。

现在,表格在一个场景中具有明显的好处:当特定类型问题的所有消息应同时更新时。问题是,这种情况很少见,也不太可能发生。更有可能发生的情况是,错误消息对于某种特定情景来说不够具体;但是,在进行另一次“扫描所有使用它的地方”之后,对其他情景来说是正确的。因此,错误消息被拆分,而不是在原地更新,因为查找表所强制执行的耦合意味着不能修改一些错误消息,而不创建新的错误消息。

这样的问题源于开发人员添加吸引开发人员的功能。

在您的情况下,您正在构建一种防错系统。让我提供一个更好的解决方案;因为拼写错误是真实存在的,也是一个真正的问题。

编写单元测试以捕获预期的输出。您很少会两次以完全相同的方式写出相同的拼写错误。是的,这是可能的,但是协调的拼写错误会对两个系统产生影响。如果您在查找表中引入了拼写错误,并在使用中引入了它,那么好处将是一个工作的程序,但很难称其为优质解决方案(因为拼写错误没有得到保护,且重复出现)。

在提交到构建系统之前,请进行代码审查。审查可能会失控,尤其是对于不灵活的审查者来说,但是一个好的审查应该会评论“你拼错了这个”。如果可能,作为一个团队来审查代码,这样您可以在他们发表评论时指出您的想法。如果您与人合作有困难(或者他们与人合作有困难),您会发现同行评审很困难。如果发生这种情况,我很抱歉,但是如果您能得到良好的同行评审,它是对抗这些问题的第二“最佳”防线。

对于这个回复的长度我感到抱歉,但我希望这能让您有机会“退一步”来看待解决方案,并了解它如何影响您未来处理代码的方式。

至于""字符串,重点关注为什么要设置它,可能比修补插入问题更有效(但我无法访问您的代码库,所以可能我是错误的!)

祝您好运。

英文:

The problem with a String holding class is that you are using a language against its language design.

Classes are supposed to introduce types. A type that provides no utility, because it's an "Everything that can be said with a string" type is rarely useful. While there are some patterns of this occurring in many programs, typically they introduce more behavior than "all the stuff is here." For example, locale databases provide replacement strings for different languages.

I'd start by carving out the sensible enumerations. Error messages might easily be converted into enums, which have easy auto-convert string representations. That way you get your "typo detection" and a classification built-in.

 DiskErrors.DISK_NOT_FOUND
 Prompts.ASK_USER_NAME
 Prompts.ASK_USER_PASSWORD

The side-effect of changes like this can hit your desired goal; but beware, these kinds of changes often signal the loss of readability.

Readability isn't what you think is easy to read, it's what a person who has never used the code would think is easy to read.

If I were to see a problem with "Your selected hard drive was not found", then I'd look through the code base for a string "Your selected hard drive was not found". That could land me in two places:

  1. In the block of code were the error message was raised.
  2. In a table mapping that string to a name.
  3. In many blocks of code where the same error message is raised.

With the table mapping, I can then do a second search, searching for where the name is used. That can land me with a few scenarios:

  1. It is used in one place.
  2. It is used in many places.

With one place, a kind of code maintenance problem arises. You now have a constant that is not used by any other part of the code maintained in a place that is not near where it is used. This means that to do any change that requires full understanding of the impact, someone has to keep the remote constant's value in mind to know if the logical change should be combined with an updated error message. It's not the updating of the error message that causes the extra chance for error, it's the fact that it is removed from the code being worked on.

With multiple places, I have to cycle through all of matches, which basically is the same effort as the multiple string matches in the first step. So, the table doesn't help me find the source of the error, it just adds extra steps that are not relevant to fixing the issue.

Now the table does have a distinct benefit in one scenario: When all the messages for a specific kind of issue should be updated at the same time. The problem is, that such a scenario is rare, and unlikely to happen. What is more likely to happen is that an error message is not specific enough for a certain scenario; but, after another "scan of all the places it is used" is correct for other scenarios. So the error message is split, instead of updated in place, because the coupling enforced by the lookup table means one cannot modify some of the error messages without creating a new error message.

Problems like this come from developers slipping in features that appeal to developers.
In your case, you're building in an anti-typo system. Let me offer a better solution; because typos are real, and a real problem too.

Write a unit test to capture the expected output. It is rare that you will write the same typo twice, exactly the same way. Yes, it is possible, but coordinated typos will impact both systems the same. If you introduce a spelling error in your lookup table, and introduce it in the usage, the benefit would be a working program, but it would be hard to call it a quality solution (because the typos weren't protected against and are there in duplicate).

Have your code reviewed before submitting it to a build system. Reviews can get out of hand, especially with inflexible reviewers, but a good review should comment on "you spelled this wrong." If possible review the code as a team, so you can point out your ideas as they make their comments. If you have difficultly working with people (or they have difficulty working with people) you will find peer-review hard. I'm sorry if that happens, but if you can get a good peer review, it's the second "best" defense against these issues.

Sorry for the length of this reply, but I hope this gives you a chance to remember to "step back" from a solution and see how it impacts your future actions with the code.

And as for the "" String, focusing on why it is being set would probably be more effective in building a better product than patching the issue with interning (but I don't have access to your code base, so I might be wrong!)

Good luck

答案2

得分: 2

Q1: 是否应该开始对字符串进行实例共享,还是这个操作过于昂贵而不值得?

很难在没有关于字符串如何创建以及它们的典型生命周期的更多信息的情况下做出判断,但一般的答案是否定的。通常情况下不值得这么做。

(而且实例共享并不能解决内存泄漏的问题。)

以下是一些原因(有些内容可能比较泛泛):

  • 对一个字符串进行实例共享并不能阻止要共享的字符串被创建。你的代码仍然需要创建它,垃圾回收器仍然需要回收它。

  • 存在一个隐藏的数据结构来组织共享的字符串。这需要内存。此外,检查一个字符串是否在实例共享的数据结构中,并在需要时将其添加到其中也会消耗 CPU。

  • 垃圾回收器需要对实例共享的数据结构进行特殊处理(类似弱引用的方式)以防止泄漏。这是一种开销。

  • 实例共享的字符串的生命周期往往比未共享的字符串长。它更有可能被提升到“老年代”堆,这会进一步延长其生命周期……因为“老年代”堆的垃圾回收不太频繁。

如果你正在使用 G1 垃圾收集器,而且重复的字符串通常具有较长的生命周期,你可以尝试启用 G1GC 字符串去重功能(参见这里)。否则,你可能最好只是让垃圾回收器处理这些字符串。Java 垃圾回收器被设计用于高效处理创建后不久就被丢弃的大量对象(如字符串)。

如果是你的代码在创建这些 Java 字符串,那么调整代码以避免创建新的零长度字符串可能是值得的。手动对零长度字符串进行实例共享,如 @ControlAltDel 的评论所述,可能并不值得这个努力。

最后,如果你打算通过某种方式减少重复,我强烈建议你设置好可以测量优化效果的条件:

  • 你是否实际上节省了内存?
  • 这是否影响了垃圾回收的频率?
  • 这是否影响了垃圾回收的暂停时间?
  • 这是否影响了请求时间 / 吞吐量?

如果测量结果表明优化没有帮助,你需要撤销这些优化。


Q2: 出于某些原因,这是一个不好的想法吗?这不应该导致创建重复的字符串,因为字符串只会被创建一次,然后在我的解析代码中引用,对吧?

我想不出任何不这么做的理由。这当然不会直接导致重复的字符串被创建。

另一方面,仅仅通过这样做并不能减少字符串的重复。代表文字的字符串对象会自动进行实例共享。

英文:

> Q1: Should I start interning strings or is it too costly of an operation to be worth it?

It is hard to say without more information about how the strings are being created and their typical lifetime, but the general answer is No. It is generally not worth it.

(And interning won't fix your memory leak.)

Here are some of the reasons (a bit hand-wavey I'm afraid):

  • Interning a String doesn't prevent the string you are interning from being created. Your code still needs to create it and the GC still needs to collect it.

  • There is a hidden data structure that organizes the interned strings. That uses memory. It also costs CPU to check to see if a string is in the interning data structure and add it if needed.

  • The GC needs to do special (weak reference like) things with the interning data structure to prevent it from leaking. That is an overhead.

  • An interned string tends to live longer than a non-interned string. It is more likely to be tenured to the "old" heap, which leads to its lifetime extended even longer ... because the "old" heap is GC'ed less often.

If you are using the G1 collector AND the duplicate strings are typically long lived, you might want to try enabling G1GC string deduplication (see here). Otherwise, you are probably better off just letting the GC deal with the strings. The Java GC's are designed to efficiently deal with with lots of objects (such as strings) being created and thrown away soon after.

If it is your code that is creating the Java strings, then it might be worth tweaking it to avoid creating new zero length strings. Manually interning the zero length strings as per @ControlAltDel's comment is probably not worth the effort.

Finally, if you are going to try to reduce the duplication one way or another, I would strongly advise that you set things up so that you can measure the effects of the optimization:

  • Do you actually save memory?
  • Does this affect the rate of GC runs?
  • Does this affect GC pauses?
  • Does it affect request times / throughput?

If the measurements say that the optimization hasn't helped, you need to back it out.


> Q2: Is this a bad idea for some reason? That shouldn't cause any duplicate strings to be created because they are created once and then referenced in my parsing code, right?

I can't think of any reason not to do that. It certainly doesn't lead directly to creating of duplicate strings.

On the other hand, you won't reduce string duplication simply by doing that. String objects that represent literals get interned automatically.

huangapple
  • 本文由 发表于 2020年10月24日 20:05:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/64513142.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定