2020年9月18日 20:42:26go评论82阅读模式

英文:

Extract GZIP content to String for large data bytes java

问题

我有一个大的字符串内容，以GZIP格式压缩，并存储为数据库中的BLOB。在从数据库中提取时，我能够获取字符串内容，如下所示：

try (
     ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
     BufferedInputStream bufis = new BufferedInputStream(new GZIPInputStream(bis));
     ByteArrayOutputStream bos = new ByteArrayOutputStream()
) {
    byte[] buf = new byte[4096];
    int len;
    while ((len = bufis.read(buf)) > 0) {
        bos.write(buf, 0, len);
    }
    retval = bos.toString();
}

我在这里遇到的问题是，对于某些输入记录，这个BLOB太大了，我只需从BLOB中筛选出5-6行。而且我必须批量处理这些记录，这导致内存占用增加。

是否有一种方法可以分块从GZIP中提取内容，如果我在初始部分获取了这些行，我可以丢弃所有剩余的块。

提前感谢您的帮助。

英文:

I have a big String content, compressed as GZIP and stored as BLOB in database. While extracting from DB, I am able to retrieve the string out of it as:

        try (
             ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
             BufferedInputStream bufis = new BufferedInputStream(new GZIPInputStream(bis));
             ByteArrayOutputStream bos = new ByteArrayOutputStream()
        ) {
            byte[] buf = new byte[4096];
            int len;
            while ((len = bufis.read(buf)) &gt; 0) {
                bos.write(buf, 0, len);
            }
            retval = bos.toString();
        }

My problem here is for some input records, I have this BLOB too big, and I have to grep hardly 5-6 lines from BLOB. And I have to process these records in bulk which is shooting up memory footprints.

Is there a way to extract content from GZIP in chunks, and I can discard all leftover chunks if I get those lines in initial parts only.

Thanks for the help in advance.

答案1

得分: 1

不要一次性将所有字节从 BLOB 中读入内存。将 BLOB 作为 InputStream 读取。

使用 BufferedReader 逐行读取并检查。

BufferedReader 包装另一个 Reader。要将解压的 InputStream 转换为 Reader，使用 InputStreamReader。非常重要的是要指定文本的字符集，不要依赖于计算机的默认字符集，因为它可能因运行环境的不同而不同。

代码示例如下：

List<String> matchingLines = new ArrayList<>();
String targetToMatch = "pankaj";

try (BufferedReader lines = new BufferedReader(
        new InputStreamReader(
            new GZIPInputStream(
                blob.getBinaryStream()),
            StandardCharsets.UTF_8))) {

    String line;
    while ((line = lines.readLine()) != null) {
        if (line.contains(targetToMatch)) {
            matchingLines.add(line);
        }
    }
}

由于你提到了 grep，你也可以使用正则表达式匹配行，尽管出于性能原因，我更喜欢使用 String.contains，除非你确实需要正则表达式。

List<String> matchingLines = new ArrayList<>();
Matcher matcher = Pattern.compile("(?i)pankaj.*ar").matcher("");

try (BufferedReader lines = new BufferedReader(
        new InputStreamReader(
            new GZIPInputStream(
                blob.getBinaryStream()),
            StandardCharsets.UTF_8))) {

    String line;
    while ((line = lines.readLine()) != null) {
        if (matcher.reset(line).find()) {
            matchingLines.add(line);
        }
    }
}

英文:

Don’t read all the bytes from the BLOB into memory at once. Read your BLOB as an InputStream.

Use a BufferedReader to read and check one line at a time.

A BufferedReader wraps another Reader. To translate your decompressing InputStream into a Reader, use InputStreamReader. It is very important that you specify the charset of the text you’re decompressing; you do not want to rely on the default charset of whatever computer you happen to be running on, since it could be different depending on where you run it.

So it would look something like this:

List&lt;String&gt; matchingLines = new ArrayList&lt;&gt;();
String targetToMatch = &quot;pankaj&quot;;

try (BufferedReader lines = new BufferedReader(
        new InputStreamReader(
            new GZIPInputStream(
                blob.getBinaryStream()),
            StandardCharsets.UTF_8))) {

    String line;
    while ((line = lines.readLine()) != null) {
        if (line.contains(targetToMatch)) {
            matchingLines.add(line);
        }
    }
}

Since you mention grep, you can also use a regular expression to match lines, though I would prefer String.contains over a regular expression for performance reasons, unless you really need a regular expression.

List&lt;String&gt; matchingLines = new ArrayList&lt;&gt;();
Matcher matcher = Pattern.comple(&quot;(?i)pankaj.*ar&quot;).matcher(&quot;&quot;);

try (BufferedReader lines = new BufferedReader(
        new InputStreamReader(
            new GZIPInputStream(
                blob.getBinaryStream()),
            StandardCharsets.UTF_8))) {

    String line;
    while ((line = lines.readLine()) != null) {
        if (matcher.reset(line).find()) {
            matchingLines.add(line);
        }
    }
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将GZIP内容解压为字符串以处理大数据字节Java

问题

答案1

SpringBoot @WebMvcTest is loading non-dependent beans when @ComponentScan is used along with @SpringBootApplication

Sure, here is the translation: 将 List<Map<String, Object>> 转换为 String[][]

轻量级使用Maven依赖的方法？

如何使用Android Pjsua 2暂停和恢复视频传输？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论