将GZIP内容解压为字符串以处理大数据字节Java

huangapple go评论82阅读模式
英文:

Extract GZIP content to String for large data bytes java

问题

我有一个大的字符串内容,以GZIP格式压缩,并存储为数据库中的BLOB。在从数据库中提取时,我能够获取字符串内容,如下所示:

try (
     ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
     BufferedInputStream bufis = new BufferedInputStream(new GZIPInputStream(bis));
     ByteArrayOutputStream bos = new ByteArrayOutputStream()
) {
    byte[] buf = new byte[4096];
    int len;
    while ((len = bufis.read(buf)) > 0) {
        bos.write(buf, 0, len);
    }
    retval = bos.toString();
}

我在这里遇到的问题是,对于某些输入记录,这个BLOB太大了,我只需从BLOB中筛选出5-6行。而且我必须批量处理这些记录,这导致内存占用增加。

是否有一种方法可以分块从GZIP中提取内容,如果我在初始部分获取了这些行,我可以丢弃所有剩余的块。

提前感谢您的帮助。

英文:

I have a big String content, compressed as GZIP and stored as BLOB in database. While extracting from DB, I am able to retrieve the string out of it as:

        try (
             ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
             BufferedInputStream bufis = new BufferedInputStream(new GZIPInputStream(bis));
             ByteArrayOutputStream bos = new ByteArrayOutputStream()
        ) {
            byte[] buf = new byte[4096];
            int len;
            while ((len = bufis.read(buf)) > 0) {
                bos.write(buf, 0, len);
            }
            retval = bos.toString();
        }

My problem here is for some input records, I have this BLOB too big, and I have to grep hardly 5-6 lines from BLOB. And I have to process these records in bulk which is shooting up memory footprints.

Is there a way to extract content from GZIP in chunks, and I can discard all leftover chunks if I get those lines in initial parts only.

Thanks for the help in advance.

答案1

得分: 1

不要一次性将所有字节从 BLOB 中读入内存。将 BLOB 作为 InputStream 读取。

使用 BufferedReader 逐行读取并检查。

BufferedReader 包装另一个 Reader。要将解压的 InputStream 转换为 Reader,使用 InputStreamReader。非常重要的是要指定文本的字符集,不要依赖于计算机的默认字符集,因为它可能因运行环境的不同而不同。

代码示例如下:

List<String> matchingLines = new ArrayList<>();
String targetToMatch = "pankaj";

try (BufferedReader lines = new BufferedReader(
        new InputStreamReader(
            new GZIPInputStream(
                blob.getBinaryStream()),
            StandardCharsets.UTF_8))) {

    String line;
    while ((line = lines.readLine()) != null) {
        if (line.contains(targetToMatch)) {
            matchingLines.add(line);
        }
    }
}

由于你提到了 grep,你也可以使用正则表达式匹配行,尽管出于性能原因,我更喜欢使用 String.contains,除非你确实需要正则表达式。

List<String> matchingLines = new ArrayList<>();
Matcher matcher = Pattern.compile("(?i)pankaj.*ar").matcher("");

try (BufferedReader lines = new BufferedReader(
        new InputStreamReader(
            new GZIPInputStream(
                blob.getBinaryStream()),
            StandardCharsets.UTF_8))) {

    String line;
    while ((line = lines.readLine()) != null) {
        if (matcher.reset(line).find()) {
            matchingLines.add(line);
        }
    }
}
英文:

Don’t read all the bytes from the BLOB into memory at once. Read your BLOB as an InputStream.

Use a BufferedReader to read and check one line at a time.

A BufferedReader wraps another Reader. To translate your decompressing InputStream into a Reader, use InputStreamReader. It is very important that you specify the charset of the text you’re decompressing; you do not want to rely on the default charset of whatever computer you happen to be running on, since it could be different depending on where you run it.

So it would look something like this:

List&lt;String&gt; matchingLines = new ArrayList&lt;&gt;();
String targetToMatch = &quot;pankaj&quot;;

try (BufferedReader lines = new BufferedReader(
        new InputStreamReader(
            new GZIPInputStream(
                blob.getBinaryStream()),
            StandardCharsets.UTF_8))) {

    String line;
    while ((line = lines.readLine()) != null) {
        if (line.contains(targetToMatch)) {
            matchingLines.add(line);
        }
    }
}

Since you mention grep, you can also use a regular expression to match lines, though I would prefer String.contains over a regular expression for performance reasons, unless you really need a regular expression.

List&lt;String&gt; matchingLines = new ArrayList&lt;&gt;();
Matcher matcher = Pattern.comple(&quot;(?i)pankaj.*ar&quot;).matcher(&quot;&quot;);

try (BufferedReader lines = new BufferedReader(
        new InputStreamReader(
            new GZIPInputStream(
                blob.getBinaryStream()),
            StandardCharsets.UTF_8))) {

    String line;
    while ((line = lines.readLine()) != null) {
        if (matcher.reset(line).find()) {
            matchingLines.add(line);
        }
    }
}

huangapple
  • 本文由 发表于 2020年9月18日 20:42:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/63956011.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定