读取非常大的包含日期和非日期数字的Excel文件

huangapple go评论81阅读模式
英文:

Read very large Excel file with date and non-date numbers

问题

package com.greshje.example.poi.streaming;

import java.io.InputStream;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.monitorjbl.xlsx.StreamingReader;

public class PoiStreamingExample {

    private static final Logger log = LoggerFactory.getLogger(PoiStreamingExample.class);

    private static final String FILE_NAME = "/com/greshje/example/poi/streaming/test-file.xlsx";

    public static void main(String[] args) {
        log.info("Starting test...");
        log.info("Getting file");
        InputStream in = PoiStreamingExample.class.getResourceAsStream(FILE_NAME);
        log.info("Got file");
        StreamingReader reader = getReader(in, 0);
        log.info("File contents:");
        for (Row row : reader) {
            String rowString = "";
            for (Cell cell : row) {
                if (rowString != "") {
                    rowString += ",";
                }
                // NEED A WAY TO GET A DATE WHERE APPROPRIATE HERE
                rowString += cell.getStringCellValue();
            }
            log.info(rowString);
        }
        log.info("Done.");
    }

    public static StreamingReader getReader(InputStream in, int sheetIndex) {
        try {
            StreamingReader reader = StreamingReader.builder()
                    .rowCacheSize(100) // number of rows to keep in memory (defaults to 10)
                    .bufferSize(4096) // buffer size to use when reading InputStream to file (defaults to 1024)
                    .sheetIndex(sheetIndex) // index of sheet to use
                    .read(in); // read the file
            return reader;
        } catch (Exception exp) {
            throw new RuntimeException(exp);
        }
    }
}

--- EDIT ------------------------

I updated the xls-stream version and DO NOT still have the issue UPDATE TO NEW VERSION FIXED THIS!!!

<!-- XLSX-STREAMER https://mvnrepository.com/artifact/com.monitorjbl/xlsx-streamer -->
<dependency>
    <groupId>com.monitorjbl</groupId>
    <artifactId>xlsx-streamer</artifactId>
    <version>2.1.0</version>
</dependency>

THIS WAS WITH THE OLD VERSION: There just doesn't seem to be any information that would allow the cell type to be determined:

读取非常大的包含日期和非日期数字的Excel文件

Also, code to get cell type doesn't seem to be supported (in the old version):

读取非常大的包含日期和非日期数字的Excel文件

The new version has much more cell information and works with dates and numbers giving the results shown in the accepted answer.

英文:

I need to read a very large Excel file that has both date and not date numbers. All of the examples I find seem to be able to do one or the other (identify a cell as a date value or read the file in constant memory).

The only solution that seems to work for the very large file is the StreamingReader approach as described here (the other examples described here either do not work for the file format I have or give out of memory heap errors).

https://stackoverflow.com/questions/42380765/read-huge-excel-file500k-rows-in-java

http://poi.apache.org/components/spreadsheet/how-to.html#event_api

What I'm doing to read the file is shown below. The entire example with test-excel.xmls (a small test file) is available in github here:

https://github.com/greshje/example-poi-streaming

POM.XML:

<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd&quot;>

&lt;properties&gt;
&lt;project.build.sourceEncoding&gt;UTF-8&lt;/project.build.sourceEncoding&gt;
&lt;build.version&gt;1.0.4-001&lt;/build.version&gt;
&lt;/properties&gt;
&lt;modelVersion&gt;4.0.0&lt;/modelVersion&gt;
&lt;groupId&gt;com.greshje.examples&lt;/groupId&gt;
&lt;artifactId&gt;poi-streaming-example&lt;/artifactId&gt;
&lt;version&gt;1.0.4-SNAPSHOT&lt;/version&gt;
&lt;packaging&gt;jar&lt;/packaging&gt;
&lt;!-- 
*
* dependencies
*
--&gt;
&lt;dependencies&gt;
&lt;!-- JUNIT https://mvnrepository.com/artifact/junit/junit --&gt;
&lt;dependency&gt;
&lt;groupId&gt;junit&lt;/groupId&gt;
&lt;artifactId&gt;junit&lt;/artifactId&gt;
&lt;version&gt;4.12&lt;/version&gt;
&lt;scope&gt;test&lt;/scope&gt;
&lt;/dependency&gt;
&lt;!-- JUNIT-TOOLBOX https://mvnrepository.com/artifact/com.googlecode.junit-toolbox/junit-toolbox --&gt;
&lt;dependency&gt;
&lt;groupId&gt;com.googlecode.junit-toolbox&lt;/groupId&gt;
&lt;artifactId&gt;junit-toolbox&lt;/artifactId&gt;
&lt;version&gt;2.4&lt;/version&gt;
&lt;scope&gt;test&lt;/scope&gt;
&lt;/dependency&gt;
&lt;!-- SLF4J LOGBACK CLASSIC https://mvnrepository.com/artifact/ch.qos.logback/logback-classic --&gt;
&lt;dependency&gt;
&lt;groupId&gt;ch.qos.logback&lt;/groupId&gt;
&lt;artifactId&gt;logback-classic&lt;/artifactId&gt;
&lt;version&gt;1.2.3&lt;/version&gt;
&lt;/dependency&gt;
&lt;!-- POI https://mvnrepository.com/artifact/org.apache.poi/poi --&gt;
&lt;dependency&gt;
&lt;groupId&gt;org.apache.poi&lt;/groupId&gt;
&lt;artifactId&gt;poi&lt;/artifactId&gt;
&lt;version&gt;4.1.2&lt;/version&gt;
&lt;/dependency&gt;
&lt;!-- POI-OOXML https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml --&gt;
&lt;dependency&gt;
&lt;groupId&gt;org.apache.poi&lt;/groupId&gt;
&lt;artifactId&gt;poi-ooxml&lt;/artifactId&gt;
&lt;version&gt;4.1.2&lt;/version&gt;
&lt;/dependency&gt;
&lt;!-- XERCES https://mvnrepository.com/artifact/xerces/xerces --&gt;
&lt;dependency&gt;
&lt;groupId&gt;xerces&lt;/groupId&gt;
&lt;artifactId&gt;xerces&lt;/artifactId&gt;
&lt;version&gt;2.4.0&lt;/version&gt;
&lt;/dependency&gt;
&lt;!-- XERCES-IMPL https://mvnrepository.com/artifact/xerces/xercesImpl --&gt;
&lt;dependency&gt;
&lt;groupId&gt;xerces&lt;/groupId&gt;
&lt;artifactId&gt;xercesImpl&lt;/artifactId&gt;
&lt;version&gt;2.12.0&lt;/version&gt;
&lt;/dependency&gt;
&lt;!-- XLSX-STREAMER https://mvnrepository.com/artifact/com.monitorjbl/xlsx-streamer --&gt;
&lt;dependency&gt;
&lt;groupId&gt;com.monitorjbl&lt;/groupId&gt;
&lt;artifactId&gt;xlsx-streamer&lt;/artifactId&gt;
&lt;version&gt;0.2.3&lt;/version&gt;
&lt;/dependency&gt;
&lt;/dependencies&gt;
&lt;!-- 
*
* build
*
--&gt;
&lt;build&gt;
&lt;plugins&gt;
&lt;plugin&gt;
&lt;artifactId&gt;maven-compiler-plugin&lt;/artifactId&gt;
&lt;version&gt;3.7.0&lt;/version&gt;
&lt;configuration&gt;
&lt;source&gt;1.8&lt;/source&gt;
&lt;target&gt;1.8&lt;/target&gt;
&lt;/configuration&gt;
&lt;/plugin&gt;
&lt;plugin&gt;
&lt;groupId&gt;org.apache.maven.plugins&lt;/groupId&gt;
&lt;artifactId&gt;maven-jar-plugin&lt;/artifactId&gt;
&lt;version&gt;2.3.2&lt;/version&gt;
&lt;!--  
&lt;configuration&gt; 
&lt;finalName&gt;&lt;/finalName&gt;                   
&lt;/configuration&gt;
--&gt;
&lt;/plugin&gt;      
&lt;/plugins&gt;
&lt;/build&gt;

</project>

Java Code:

package com.greshje.example.poi.streaming;
import java.io.InputStream;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.monitorjbl.xlsx.StreamingReader;
public class PoiStreamingExample {
private static final Logger log = LoggerFactory.getLogger(PoiStreamingExample.class);
private static final String FILE_NAME = &quot;/com/greshje/example/poi/streaming/test-file.xlsx&quot;;
public static void main(String[] args) {
log.info(&quot;Starting test...&quot;);
log.info(&quot;Getting file&quot;);
InputStream in = PoiStreamingExample.class.getResourceAsStream(FILE_NAME);
log.info(&quot;Got file&quot;);
StreamingReader reader = getReader(in, 0);
log.info(&quot;File contents:&quot;);
for (Row row : reader) {
String rowString = &quot;&quot;;
for (Cell cell : row) {
if (rowString != &quot;&quot;) {
rowString += &quot;,&quot;;
}
// NEED A WAY TO GET A DATE WHERE APPROPRIATE HERE
rowString += cell.getStringCellValue();
}
log.info(rowString);
}
log.info(&quot;Done.&quot;);
}
public static StreamingReader getReader(InputStream in, int sheetIndex) {
try {
StreamingReader reader = StreamingReader.builder()
.rowCacheSize(100) // number of rows to keep in memory (defaults to 10)
.bufferSize(4096) // buffer size to use when reading InputStream to file (defaults to 1024)
.sheetIndex(sheetIndex) // index of sheet to use
.read(in); // read the file
return reader;
} catch (Exception exp) {
throw new RuntimeException(exp);
}
}
}

My test data looks like this:

读取非常大的包含日期和非日期数字的Excel文件

The output looks like this (date and not date values are represented as numbers).

2020-09-06 10:47:13,814 10:47:13.814 [main] INFO  (PoiStreamingExample.java:19) - Starting test...
2020-09-06 10:47:13,822 10:47:13.822 [main] INFO  (PoiStreamingExample.java:20) - Getting file
2020-09-06 10:47:13,823 10:47:13.823 [main] INFO  (PoiStreamingExample.java:22) - Got file
2020-09-06 10:47:15,117 10:47:15.117 [main] INFO  (PoiStreamingExample.java:24) - File contents:
2020-09-06 10:47:15,149 10:47:15.149 [main] INFO  (PoiStreamingExample.java:33) - Number,Date (mostly),Date (mostly)
2020-09-06 10:47:15,150 10:47:15.150 [main] INFO  (PoiStreamingExample.java:33) - 123456,43550
2020-09-06 10:47:15,150 10:47:15.150 [main] INFO  (PoiStreamingExample.java:33) - 123456,43685,44019
2020-09-06 10:47:15,150 10:47:15.150 [main] INFO  (PoiStreamingExample.java:33) - 123456,43522,43535
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 123456,43503,43538
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 123456,43535,43564
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 123456,43536,43574
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43553,43700
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 7890123,44041
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43521,43550
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43558,43580
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43567,43599
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43633,43641
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43573,43615
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43577,43606
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43719,43754
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43634,43641
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:33) - 123,43550
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:33) - smith,43550
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:33) - jones,43550
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:33) - 43550,43550
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:35) - Done.

--- EDIT ------------------------

I updated the xls-stream version and DO NOT still have the issue UPDATE TO NEW VERSION FIXED THIS!!!

&lt;!-- XLSX-STREAMER https://mvnrepository.com/artifact/com.monitorjbl/xlsx-streamer --&gt;
&lt;dependency&gt;
&lt;groupId&gt;com.monitorjbl&lt;/groupId&gt;
&lt;artifactId&gt;xlsx-streamer&lt;/artifactId&gt;
&lt;version&gt;2.1.0&lt;/version&gt;
&lt;/dependency&gt;

THIS WAS WITH THE OLD VERSION: There just doesn't seem to be any information that would allow the cell type to be determined:

读取非常大的包含日期和非日期数字的Excel文件

Also, code to get cell type doesn't seem to be supported (in the old version)

读取非常大的包含日期和非日期数字的Excel文件

The new version has much more cell information and works with dates and numbers giving the results shown in the accepted answer.

答案1

得分: 2

使用最新版本的Excel Streaming Reader,版本为2.1.0,这个问题已经解决。

使用您的test-file.xlsx文件和以下代码:

import java.io.InputStream;

import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Cell;

import com.monitorjbl.xlsx.StreamingReader;

public class PoiStreamingExample {

 private static final String FILE_NAME = "./test-file.xlsx";

 public static void main(String[] args) {

  try (
   InputStream is = PoiStreamingExample.class.getResourceAsStream(FILE_NAME);
   Workbook workbook = StreamingReader.builder()
    .rowCacheSize(100)
    .bufferSize(4096)
    .open(is)) {
   Sheet sheet =  workbook.getSheetAt(0);
   for (Row r : sheet) {
    String rowString = "";
    for (Cell c : r) {
     if (rowString != "") {
      rowString += ",";
     } 
     rowString += c.getStringCellValue();
    }
    System.out.println(rowString);
   }
  } catch (Exception ex) {
   ex.printStackTrace();
  }

 }
}

它打印出:

Number,Date (mostly),Date (mostly)
123456,3/26/19
123456,8/8/19,7/7/20
123456,2/26/19,3/11/19
123456,2/7/19,3/14/19
123456,3/11/19,4/9/19
123456,3/12/19,4/19/19
7890123,3/29/19,8/23/19
7890123,7/29/20
7890123,2/25/19,3/26/19
7890123,4/3/19,4/25/19
7890123,4/12/19,5/14/19
7890123,6/17/19,6/25/19
7890123,4/18/19,5/30/19
7890123,4/22/19,5/21/19
7890123,9/11/19,10/16/19
7890123,6/18/19,6/25/19
123,43550
smith,43550
jones,43550
43550,43550
英文:

Using latest version of Excel Streaming Reader, which is 2.1.0, this problem is gone.

Using your test-file.xlsx and following code:

import java.io.InputStream;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Cell;
import com.monitorjbl.xlsx.StreamingReader;
public class PoiStreamingExample {
private static final String FILE_NAME = &quot;./test-file.xlsx&quot;;
public static void main(String[] args) {
try (
InputStream is = PoiStreamingExample.class.getResourceAsStream(FILE_NAME);
Workbook workbook = StreamingReader.builder()
.rowCacheSize(100)
.bufferSize(4096)
.open(is)) {
Sheet sheet =  workbook.getSheetAt(0);
for (Row r : sheet) {
String rowString = &quot;&quot;;
for (Cell c : r) {
if (rowString != &quot;&quot;) {
rowString += &quot;,&quot;;
} 
rowString += c.getStringCellValue();
}
System.out.println(rowString);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}

It prints:

Number,Date (mostly),Date (mostly)
123456,3/26/19
123456,8/8/19,7/7/20
123456,2/26/19,3/11/19
123456,2/7/19,3/14/19
123456,3/11/19,4/9/19
123456,3/12/19,4/19/19
7890123,3/29/19,8/23/19
7890123,7/29/20
7890123,2/25/19,3/26/19
7890123,4/3/19,4/25/19
7890123,4/12/19,5/14/19
7890123,6/17/19,6/25/19
7890123,4/18/19,5/30/19
7890123,4/22/19,5/21/19
7890123,9/11/19,10/16/19
7890123,6/18/19,6/25/19
123,43550
smith,43550
jones,43550
43550,43550

huangapple
  • 本文由 发表于 2020年9月6日 23:01:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/63765556.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定