java.lang.NullPointerException while reading specific sheet from xlsx using org.zuinnote.spark.office.excel

huangapple go评论45阅读模式
英文:

java.lang.NullPointerException while reading specific sheet from xlsx using org.zuinnote.spark.office.excel

问题

我们正在尝试使用org.zuinnote.spark.office.excel从Excel(包含3个工作表的.xlsx文件)中读取一个特定的工作表到Spark DataFrame。

我们正在使用MSExcelLowFootprintParser解析器。

使用的代码如下:

val hadoopConf = new Configuration()
val spark = SparkSession.builder()
  .appName("ExcelReadingExample")
  .master("local[*]")
  .getOrCreate()

spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.lowFootprint", "true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.header.read", "true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.sheets", "Most Runs Over - 2010")

spark.sparkContext.setLogLevel("INFO")
val schema = StructType(Seq(
    StructField("col", IntegerType, nullable = true),
    StructField("col1", StringType, nullable = true),
    StructField("Runs", IntegerType, nullable = true),
    StructField("BF", IntegerType, nullable = true),
    StructField("SR", DoubleType, nullable = true),
    StructField("s4", IntegerType, nullable = true),
    StructField("s6", IntegerType, nullable = true),
    StructField("Against", StringType, nullable = true),
    StructField("Venue", StringType, nullable = true),
    StructField("Match Date", StringType, nullable = true),
    StructField("Match Time", StringType, nullable = true),
    StructField("Match Partition Time", StringType, nullable = true)
))

val df: Dataset[Row] = spark.read
  .format("org.zuinnote.spark.office.excel")
  .option("hadoopoffice.read.sheets", "Most Runs Over - 2010")
  .option("read.spark.simpleMode", "true")
  .option("read.lowFootprint", "true")
  .option("multiLine", "true")
  .option("read.spark.simpleMode.maxInferRows", "1000")
  .option("read.header.read", "true")
  .schema(schema)
  .load("D:\\excel\\spark-hadoopoffice-ds-s2-ho-1.3.9\\spark-hadoopoffice-ds-s2-ho-1.3.9\\src\\resources\\MostRuns_Over2008.xlsx")

df.show()

我们遇到了以下异常:

java.lang.NullPointerException
...

如果我读取所有工作表而不是一个工作表,它可以正常工作。如果我不使用低内存占用解析器,它也可以正常工作。

  • org.apache.poi 版本 4.1.2
  • spark-hadoopoffice-ds 版本 1.3.9
  • Spark 版本 3.1.2
英文:

We are trying to read one specific sheet from Excel (.xlsx with 3 sheets) using org.zuinnote.spark.office.excel into spark dataframe.

We are using MSExcelLowFootprintParser parser.

code used is

    val hadoopConf = new Configuration()
val spark = SparkSession.builder()
.appName("ExcelReadingExample")
.master("local[*]")
.getOrCreate()
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.lowFootprint", "true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.header.read","true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.sheets", "Most Runs Over - 2010")
spark.sparkContext.setLogLevel("INFO")
val schema = StructType(Seq(
StructField("col", IntegerType, nullable = true),
StructField("col1", StringType, nullable = true),
StructField("Runs", IntegerType, nullable = true),
StructField("BF", IntegerType, nullable = true),
StructField("SR", DoubleType, nullable = true),
StructField("s4", IntegerType, nullable = true),
StructField("s6", IntegerType, nullable = true),
StructField("Against", StringType, nullable = true),
StructField("Venue", StringType, nullable = true),
StructField("Match Date", StringType, nullable = true),
StructField("Match Time", StringType, nullable = true),
StructField("Match Partition Time", StringType, nullable = true)
))
val df: Dataset[Row] = spark.read
.format("org.zuinnote.spark.office.excel")
.option("hadoopoffice.read.sheets", "Most Runs Over - 2010")
.option("read.spark.simpleMode","true")
.option("read.lowFootprint", "true")
.option("multiLine", "true")
.option("read.spark.simpleMode.maxInferRows","1000")
.option("read.header.read","true")
.schema(schema)
.load("D:\\excel\\spark-hadoopoffice-ds-s2-ho-1.3.9\\spark-hadoopoffice-ds-s2-ho-1.3.9\\src\\resources\\MostRuns_Over2008.xlsx")
df.show();

We are facing exception

java.lang.NullPointerException
at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetName(XSSFReader.java:325)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:374)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.parse(MSExcelLowFootprintParser.java:267)
at org.zuinnote.hadoop.office.format.common.OfficeReader.parse(OfficeReader.java:92)
at org.zuinnote.hadoop.office.format.mapreduce.AbstractSpreadSheetDocumentRecordReader.initialize(AbstractSpreadSheetDocumentRecordReader.java:138)
at org.zuinnote.spark.office.excel.HadoopFileExcelReader.<init>(HadoopFileExcelReader.scala:55)
at org.zuinnote.spark.office.excel.DefaultSource.$anonfun$buildReader$4(DefaultSource.scala:322)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:185)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:240)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:832)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1449)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2281)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

Its working if i read all sheets instead of one sheet. Its also working if i dont use low foot print parser.

  • org.apache.poi version 4.1.2
  • spark-hadoopoffice-ds version 1.3.9
  • spark version 3.1.2

答案1

得分: 1

"如果我读取所有工作表而不是一个工作表,它就能正常工作。如果我不使用低内存占用的解析器,它也能正常工作。

建议进行跟踪。在 https://github.com/ZuInnoTe/hadoopoffice/issues 上报告此问题。

空指针异常的原因

java.lang.NullPointerException
at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetName(XSSFReader.java:325)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:374)

MSExcelLowFootprintParser 在程序行 374 调用 iter.getSheetName(),其中 iterXSSFReader.SheetIterator。然后 XSSFReader.SheetIterator.getSheetName 调用 xssfSheetRef.getName(),但 XSSFSheetRef xssfSheetRef 甚至不会被设置为任何值,将为 null,除非调用 XSSFReader.SheetIterator.next

如果读取所有工作表而不是一个工作表,它就能正常工作,因为然后 MSExcelLowFootprintParser.java 代码行 374 永远不会被执行,因为没有提供工作表名称。然后 while (iter.hasNext()) {... 之后的第一个调用将是 InputStream rawSheetInputStream = iter.next();,并且该调用会调用 XSSFReader.SheetIterator.next,从而设置 xssfSheetRef

如果不使用低内存占用的解析器,它也能正常工作,因为它根本不使用 org.apache.poi.xssf.eventusermodel.XSSFReader

解决方案

org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage 中,在 while (iter.hasNext()) { 之后需要立即调用 InputStream rawSheetInputStream = iter.next();。这将设置 XSSFReader.SheetIterator.xssfSheetRef,从而避免空指针异常。"

英文:

Its working if i read all sheets instead of one sheet. It is also working if i don't use low footprint parser.

Good observed. File a bug about this on https://github.com/ZuInnoTe/hadoopoffice/issues.

Reason for the NullPointerException

java.lang.NullPointerException
at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetName(XSSFReader.java:325)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:374)

MSExcelLowFootprintParser in program line 374 calls iter.getSheetName() where iter is XSSFReader.SheetIterator. Then XSSFReader.SheetIterator.getSheetName calls xssfSheetRef.getName() but the XSSFSheetRef xssfSheetRef will not even be set to something and will be null unless XSSFReader.SheetIterator.next gets called.

Its working if you read all sheets instead of one sheet because then MSExcelLowFootprintParser.java code line 374 never gets reached because no sheet names are given. Then the first call after while (iter.hasNext()) {... will be InputStream rawSheetInputStream = iter.next(); and that calls XSSFReader.SheetIterator.next which sets xssfSheetRef.

It is also working if you don't use low footprint parser because that not uses org.apache.poi.xssf.eventusermodel.XSSFReader at all.

Solution

InputStream rawSheetInputStream = iter.next(); needs get called immidatelly after while (iter.hasNext()) { in org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage. That will set XSSFReader.SheetIterator.xssfSheetRef and so avoid the NullPointerException.

huangapple
  • 本文由 发表于 2023年6月15日 20:42:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76482603.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定