2023年7月20日 15:55:16go评论117阅读模式

英文:

Java heap space error when trying to create an NGram model

问题

在一个较大的项目的一部分中，我需要使用Java创建一个NGram模型，但不是最佳选择，也不是可选项。我正在使用JDK 20和VS Code来运行代码。当我尝试在VS Code上运行代码时，我收到以下错误信息：

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.base/java.util.HashMap.resize(HashMap.java:710)
    at java.base/java.util.HashMap.putVal(HashMap.java:635)
    at java.base/java.util.HashMap.put(HashMap.java:618)
    at Ngram.NGramNode.addNGram(NGramNode.java:277)
    at Ngram.NGramNode.addNGram(NGramNode.java:280)
    at Ngram.NGram.addNGramSentence(NGram.java:157)
    at com.glmadu.editdistance.TRspellChecker.getCorpus(TRspellChecker.java:68)
    at com.glmadu.editdistance.TRspellChecker.checkFileSpell(TRspellChecker.java:22)
    at com.glmadu.App.main(App.java:21)

我已经将堆空间从launch.JSON中增加到了8GB，而语料库文件大小约为750MB，代码片段如下：

private static void getCorpus(String output) {
    ArrayList<ArrayList<String>> corpus = new ArrayList<>();
    try (BufferedReader br = new BufferedReader(new FileReader("path/to/corpus"))) {
        String line;
        while ((line = br.readLine()) != null) {
            String[] tokens = line.split(" "); //line 63
            ArrayList<String> sentence = new ArrayList<>();
            for (String token : tokens) {
                sentence.add(token); // line 68
            }
            corpus.add(sentence);
        }
        NGram<String> nGram = new NGram<>(corpus, 2);
        nGram.saveAsText(output);
    } catch (IOException e) {
        e.printStackTrace();
    }
}

我不明白为什么在将堆空间增加到8GB后仍然会出现堆空间错误，我尝试了12GB和10GB，但仍然出现以下错误：

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
    at java.base/java.lang.String.split(String.java:3138)
    at java.base/java.lang.String.split(String.java:3212)
    at com.glmadu.editdistance.TRspellChecker.getCorpus(TRspellChecker.java:63)
    at com.glmadu.editdistance.TRspellChecker.checkFileSpell(TRspellChecker.java:22)
    at com.glmadu.App.main(App.java:21)

当我尝试这样做时仍然出现错误，即使我尝试只读取前1000行也是如此。我尝试不保存NGram模型，从中我可以得出结论，问题不是NGram建模，而是数组在内存中占用了太多空间。此外，当我从任务管理器中检查内存使用情况时，它停留在4-5GB左右，没有接近我分配的8GB。

英文:

> In part of a larger project I need to create an NGram model using Java which is not optimal nor optional I am using JDK 20 and vs code to run the code. When I try to run the code on vs code I get:

`      Exception in thread &quot;main&quot; java.lang.OutOfMemoryError: Java heap space
        at java.base/java.util.HashMap.resize(HashMap.java:710)
        at java.base/java.util.HashMap.putVal(HashMap.java:635)
        at java.base/java.util.HashMap.put(HashMap.java:618)
        at Ngram.NGramNode.addNGram(NGramNode.java:277)
        at Ngram.NGramNode.addNGram(NGramNode.java:280)
        at Ngram.NGram.addNGramSentence(NGram.java:157)
        at com.glmadu.editdistance.TRspellChecker.getCorpus(TRspellChecker.java:68)
        at com.glmadu.editdistance.TRspellChecker.checkFileSpell(TRspellChecker.java:22)
        at com.glmadu.App.main(App.java:21)`

Error I did increase the heap space from launch.JSON to 8GB and the corpus file is around 750 MB the code piece is here

private static void getCorpus(String output) {
        ArrayList&lt;ArrayList&lt;String&gt;&gt; corpus = new ArrayList&lt;&gt;();
        try (BufferedReader br = new BufferedReader(new FileReader(&quot;path/to/corpus&quot;))) {
            String line;
            while ((line = br.readLine()) != null) {
                String[] tokens = line.split(&quot; &quot;); //line 63
                ArrayList&lt;String&gt; sentence = new ArrayList&lt;&gt;();
                for (String token : tokens) {
                    sentence.add(token); // line 68
                }
                corpus.add(sentence);
            }
            NGram&lt;String&gt; nGram = new NGram&lt;&gt;(corpus, 2);
            nGram.saveAsText(output);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

I do not understand how can I still get Heap space after push it to 8GB I tried with 12 and 10 but I get

&gt; Exception in thread &quot;main&quot; java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
`        at java.base/java.lang.String.split(String.java:3138)
        at java.base/java.lang.String.split(String.java:3212)
        at com.glmadu.editdistance.TRspellChecker.getCorpus(TRspellChecker.java:63)
        at com.glmadu.editdistance.TRspellChecker.checkFileSpell(TRspellChecker.java:22)
        at com.glmadu.App.main(App.java:21)
`

error when I do that. I am using vs code to run this. Thanks in advance

I tried increasing the heap size, I tried reading less lines still got error even when I tried to read first 1000 lines. I tried not saving NGram model and from that I can derive it's not the NGram modeling but mode like arrays take too much space in memory, also when I checked the memory usage from task manager it sits at 4-5 GB and does not get close to 8 I allocated

答案1

得分: 0

这是我“解决”问题的方式：我根据 @Sascha 的建议设置了初始数组大小，但仍然存在问题，所以我将问题分解并稍后合并它们。

public static void getCorpus(String output) {
    int count = 0;
    int countMain = 0;
    try (BufferedReader br = new BufferedReader(
            new FileReader("path\to\corpus"))) {
        String line;
        while ((line = br.readLine()) != null) {
            count++;
            if (count > ARRAY_SIZE) {
                NGram nGram = new NGram(corpus, 2);
                saveNgram(output, nGram, countMain);
                countMain++;
                System.out.println("清除 ArrayList " + countMain);
                corpus.clear();
                count = 0;
            }
            String[] tokens = line.split(" ");
            ArrayList<String> sentence = new ArrayList<>();
            for (String token : tokens) {
                sentence.add(token);
            }
            corpus.add(sentence);
        }
        countMain++;
        NGram nGram = new NGram(corpus, 2);
        corpus.clear();
        nGram.saveAsText(output + "final");
        String outFile = output + "final" + ".txt";
        FileWriter fWrite = new FileWriter(outFile, StandardCharsets.UTF_8);
        BufferedWriter bfWrite = new BufferedWriter(fWrite);
        for (int i = 0; i <= countMain; i++) {
            String path = output + "_part" + i + ".txt";
            mergeNGram(path, outFile, bfWrite);
        }
        fWrite.close();
        bfWrite.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
}
private static void saveNgram(String outputPath, NGram nGram, int countMain) {
        String finalPath = outputPath + "_part" + countMain + ".txt";
        File nFile = new File(finalPath);
        if (!nFile.exists()) {
            nGram.saveAsText(finalPath);
        }
    }
private static void mergeNGram(String path, String output, BufferedWriter bfWrite) {
        File inFile = new File(path);
        try (FileReader fRead = new FileReader(inFile, StandardCharsets.UTF_8)) {
            BufferedReader bfRead = new BufferedReader(fRead);
            String Line;
            while ((Line = bfRead.readLine()) != null) {
                bfWrite.write(Line);
                bfWrite.write(System.lineSeparator());
            }
            bfWrite.flush();
            bfRead.close();
            fRead.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

这远非完美，但解决了我的当前问题，这是我目前所能做的，特别感谢 @Sascha 的帮助，我将这里留下，以便有类似问题的人可以找到并采用。

英文:

Alright here is how I "solved" the problem I set an initial array size according with @Sascha 's response but it still got problems so I divided the problem and merged them later on

public static void getCorpus(String output) {
int count = 0;
int countMain = 0;
try (BufferedReader br = new BufferedReader(
new FileReader(&quot;path\to\corpus&quot;))) {
String line;
while ((line = br.readLine()) != null) {
count++;
if (count &gt; ARRAY_SIZE) {
NGram nGram = new NGram(corpus, 2);
saveNgram(output, nGram, countMain);
countMain++;
System.out.println(&quot;Clearing ArrayList &quot; + countMain);
corpus.clear();
count = 0;
}
String[] tokens = line.split(&quot; &quot;);
ArrayList&lt;String&gt; sentence = new ArrayList&lt;&gt;();
for (String token : tokens) {
sentence.add(token);
}
corpus.add(sentence);
}
countMain++;
NGram nGram = new NGram(corpus, 2);
corpus.clear();
nGram.saveAsText(output + &quot;final&quot;);
String outFile = output + &quot;final&quot; + &quot;.txt&quot;;
FileWriter fWrite = new FileWriter(outFile, StandardCharsets.UTF_8);
BufferedWriter bfWrite = new BufferedWriter(fWrite);
for (int i = 0; i &lt;= countMain; i++) {
String path = output + &quot;_part&quot; + i + &quot;.txt&quot;;
mergeNGram(path, outFile, bfWrite);
}
fWrite.close();
bfWrite.close();
} catch (IOException e) {
e.printStackTrace();
}
}

It takes a String path to output file after that to save it runs saveNgram which is quite basic as it takes the output concatenate it to add partx to it and saves the NGram

 private static void saveNgram(String outputPath, NGram nGram, int countMain) {
String finalPath = outputPath + &quot;_part&quot; + countMain + &quot;.txt&quot;;
File nFile = new File(finalPath);
if (!nFile.exists()) {
nGram.saveAsText(finalPath);
}
}

At the and if there are any leftover lines saves it again and calls mergeNGram which is just a BufferedReader/Writer to write to the final file

   private static void mergeNGram(String path, String output, BufferedWriter bfWrite) {
File inFile = new File(path);
try (FileReader fRead = new FileReader(inFile, StandardCharsets.UTF_8)) {
BufferedReader bfRead = new BufferedReader(fRead);
String Line;
while ((Line = bfRead.readLine()) != null) {
bfWrite.write(Line);
bfWrite.write(System.lineSeparator());
}
bfWrite.flush();
bfRead.close();
fRead.close();
} catch (IOException e) {
e.printStackTrace();
}
}

It is nowhere near perfect but it solves my current problem and that is all I can do at the moment, special thanks to @Sascha for the help I am leaving this here so anyone with similar problem can find and adopt

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Java堆空间错误，尝试创建一个NGram模型时发生错误。

问题

答案1

Java如何存储无限递归对象

Is there an efficient way to detect if a string contains a substring which is in a large set of characteristic strings?

WinAppDriver与LeanFT

不带注释的参数覆盖@NotNull参数

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。