英文:
java.lang.NullPointerException while reading specific sheet from xlsx using org.zuinnote.spark.office.excel
问题
我们正在尝试使用org.zuinnote.spark.office.excel从Excel(包含3个工作表的.xlsx文件)中读取一个特定的工作表到Spark DataFrame。
我们正在使用MSExcelLowFootprintParser解析器。
使用的代码如下:
val hadoopConf = new Configuration()
val spark = SparkSession.builder()
.appName("ExcelReadingExample")
.master("local[*]")
.getOrCreate()
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.lowFootprint", "true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.header.read", "true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.sheets", "Most Runs Over - 2010")
spark.sparkContext.setLogLevel("INFO")
val schema = StructType(Seq(
StructField("col", IntegerType, nullable = true),
StructField("col1", StringType, nullable = true),
StructField("Runs", IntegerType, nullable = true),
StructField("BF", IntegerType, nullable = true),
StructField("SR", DoubleType, nullable = true),
StructField("s4", IntegerType, nullable = true),
StructField("s6", IntegerType, nullable = true),
StructField("Against", StringType, nullable = true),
StructField("Venue", StringType, nullable = true),
StructField("Match Date", StringType, nullable = true),
StructField("Match Time", StringType, nullable = true),
StructField("Match Partition Time", StringType, nullable = true)
))
val df: Dataset[Row] = spark.read
.format("org.zuinnote.spark.office.excel")
.option("hadoopoffice.read.sheets", "Most Runs Over - 2010")
.option("read.spark.simpleMode", "true")
.option("read.lowFootprint", "true")
.option("multiLine", "true")
.option("read.spark.simpleMode.maxInferRows", "1000")
.option("read.header.read", "true")
.schema(schema)
.load("D:\\excel\\spark-hadoopoffice-ds-s2-ho-1.3.9\\spark-hadoopoffice-ds-s2-ho-1.3.9\\src\\resources\\MostRuns_Over2008.xlsx")
df.show()
我们遇到了以下异常:
java.lang.NullPointerException
...
如果我读取所有工作表而不是一个工作表,它可以正常工作。如果我不使用低内存占用解析器,它也可以正常工作。
- org.apache.poi 版本 4.1.2
- spark-hadoopoffice-ds 版本 1.3.9
- Spark 版本 3.1.2
英文:
We are trying to read one specific sheet from Excel (.xlsx with 3 sheets) using org.zuinnote.spark.office.excel into spark dataframe.
We are using MSExcelLowFootprintParser parser.
code used is
val hadoopConf = new Configuration()
val spark = SparkSession.builder()
.appName("ExcelReadingExample")
.master("local[*]")
.getOrCreate()
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.lowFootprint", "true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.header.read","true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.sheets", "Most Runs Over - 2010")
spark.sparkContext.setLogLevel("INFO")
val schema = StructType(Seq(
StructField("col", IntegerType, nullable = true),
StructField("col1", StringType, nullable = true),
StructField("Runs", IntegerType, nullable = true),
StructField("BF", IntegerType, nullable = true),
StructField("SR", DoubleType, nullable = true),
StructField("s4", IntegerType, nullable = true),
StructField("s6", IntegerType, nullable = true),
StructField("Against", StringType, nullable = true),
StructField("Venue", StringType, nullable = true),
StructField("Match Date", StringType, nullable = true),
StructField("Match Time", StringType, nullable = true),
StructField("Match Partition Time", StringType, nullable = true)
))
val df: Dataset[Row] = spark.read
.format("org.zuinnote.spark.office.excel")
.option("hadoopoffice.read.sheets", "Most Runs Over - 2010")
.option("read.spark.simpleMode","true")
.option("read.lowFootprint", "true")
.option("multiLine", "true")
.option("read.spark.simpleMode.maxInferRows","1000")
.option("read.header.read","true")
.schema(schema)
.load("D:\\excel\\spark-hadoopoffice-ds-s2-ho-1.3.9\\spark-hadoopoffice-ds-s2-ho-1.3.9\\src\\resources\\MostRuns_Over2008.xlsx")
df.show();
We are facing exception
java.lang.NullPointerException
at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetName(XSSFReader.java:325)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:374)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.parse(MSExcelLowFootprintParser.java:267)
at org.zuinnote.hadoop.office.format.common.OfficeReader.parse(OfficeReader.java:92)
at org.zuinnote.hadoop.office.format.mapreduce.AbstractSpreadSheetDocumentRecordReader.initialize(AbstractSpreadSheetDocumentRecordReader.java:138)
at org.zuinnote.spark.office.excel.HadoopFileExcelReader.<init>(HadoopFileExcelReader.scala:55)
at org.zuinnote.spark.office.excel.DefaultSource.$anonfun$buildReader$4(DefaultSource.scala:322)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:185)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:240)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:832)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1449)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2281)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Its working if i read all sheets instead of one sheet. Its also working if i dont use low foot print parser.
- org.apache.poi version 4.1.2
- spark-hadoopoffice-ds version 1.3.9
- spark version 3.1.2
答案1
得分: 1
"如果我读取所有工作表而不是一个工作表,它就能正常工作。如果我不使用低内存占用的解析器,它也能正常工作。
建议进行跟踪。在 https://github.com/ZuInnoTe/hadoopoffice/issues 上报告此问题。
空指针异常的原因
java.lang.NullPointerException
at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetName(XSSFReader.java:325)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:374)
MSExcelLowFootprintParser 在程序行 374 调用 iter.getSheetName()
,其中 iter
是 XSSFReader.SheetIterator。然后 XSSFReader.SheetIterator.getSheetName 调用 xssfSheetRef.getName()
,但 XSSFSheetRef xssfSheetRef
甚至不会被设置为任何值,将为 null,除非调用 XSSFReader.SheetIterator.next。
如果读取所有工作表而不是一个工作表,它就能正常工作,因为然后 MSExcelLowFootprintParser.java
代码行 374 永远不会被执行,因为没有提供工作表名称。然后 while (iter.hasNext()) {...
之后的第一个调用将是 InputStream rawSheetInputStream = iter.next();
,并且该调用会调用 XSSFReader.SheetIterator.next
,从而设置 xssfSheetRef
。
如果不使用低内存占用的解析器,它也能正常工作,因为它根本不使用 org.apache.poi.xssf.eventusermodel.XSSFReader
。
解决方案
在 org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage
中,在 while (iter.hasNext()) {
之后需要立即调用 InputStream rawSheetInputStream = iter.next();
。这将设置 XSSFReader.SheetIterator.xssfSheetRef
,从而避免空指针异常。"
英文:
Its working if i read all sheets instead of one sheet. It is also working if i don't use low footprint parser.
Good observed. File a bug about this on https://github.com/ZuInnoTe/hadoopoffice/issues.
Reason for the NullPointerException
java.lang.NullPointerException
at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetName(XSSFReader.java:325)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:374)
MSExcelLowFootprintParser in program line 374 calls iter.getSheetName()
where iter
is XSSFReader.SheetIterator. Then XSSFReader.SheetIterator.getSheetName calls xssfSheetRef.getName()
but the XSSFSheetRef xssfSheetRef
will not even be set to something and will be null unless XSSFReader.SheetIterator.next gets called.
Its working if you read all sheets instead of one sheet because then MSExcelLowFootprintParser.java
code line 374 never gets reached because no sheet names are given. Then the first call after while (iter.hasNext()) {...
will be InputStream rawSheetInputStream = iter.next();
and that calls XSSFReader.SheetIterator.next
which sets xssfSheetRef
.
It is also working if you don't use low footprint parser because that not uses org.apache.poi.xssf.eventusermodel.XSSFReader
at all.
Solution
InputStream rawSheetInputStream = iter.next();
needs get called immidatelly after while (iter.hasNext()) {
in org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage
. That will set XSSFReader.SheetIterator.xssfSheetRef
and so avoid the NullPointerException.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论