2023年6月15日 20:42:57go评论58阅读模式

英文:

java.lang.NullPointerException while reading specific sheet from xlsx using org.zuinnote.spark.office.excel

问题

我们正在尝试使用org.zuinnote.spark.office.excel从Excel（包含3个工作表的.xlsx文件）中读取一个特定的工作表到Spark DataFrame。

我们正在使用MSExcelLowFootprintParser解析器。

使用的代码如下：

val hadoopConf = new Configuration()
val spark = SparkSession.builder()
  .appName("ExcelReadingExample")
  .master("local[*]")
  .getOrCreate()

spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.lowFootprint", "true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.header.read", "true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.sheets", "Most Runs Over - 2010")

spark.sparkContext.setLogLevel("INFO")
val schema = StructType(Seq(
    StructField("col", IntegerType, nullable = true),
    StructField("col1", StringType, nullable = true),
    StructField("Runs", IntegerType, nullable = true),
    StructField("BF", IntegerType, nullable = true),
    StructField("SR", DoubleType, nullable = true),
    StructField("s4", IntegerType, nullable = true),
    StructField("s6", IntegerType, nullable = true),
    StructField("Against", StringType, nullable = true),
    StructField("Venue", StringType, nullable = true),
    StructField("Match Date", StringType, nullable = true),
    StructField("Match Time", StringType, nullable = true),
    StructField("Match Partition Time", StringType, nullable = true)
))

val df: Dataset[Row] = spark.read
  .format("org.zuinnote.spark.office.excel")
  .option("hadoopoffice.read.sheets", "Most Runs Over - 2010")
  .option("read.spark.simpleMode", "true")
  .option("read.lowFootprint", "true")
  .option("multiLine", "true")
  .option("read.spark.simpleMode.maxInferRows", "1000")
  .option("read.header.read", "true")
  .schema(schema)
  .load("D:\\excel\\spark-hadoopoffice-ds-s2-ho-1.3.9\\spark-hadoopoffice-ds-s2-ho-1.3.9\\src\\resources\\MostRuns_Over2008.xlsx")

df.show()

我们遇到了以下异常：

java.lang.NullPointerException
...

如果我读取所有工作表而不是一个工作表，它可以正常工作。如果我不使用低内存占用解析器，它也可以正常工作。

org.apache.poi 版本 4.1.2
spark-hadoopoffice-ds 版本 1.3.9
Spark 版本 3.1.2

英文:

We are trying to read one specific sheet from Excel (.xlsx with 3 sheets) using org.zuinnote.spark.office.excel into spark dataframe.

We are using MSExcelLowFootprintParser parser.

code used is

    val hadoopConf = new Configuration()
val spark = SparkSession.builder()
.appName(&quot;ExcelReadingExample&quot;)
.master(&quot;local[*]&quot;)
.getOrCreate()
spark.sparkContext.hadoopConfiguration.set(&quot;hadoopoffice.read.lowFootprint&quot;, &quot;true&quot;)
spark.sparkContext.hadoopConfiguration.set(&quot;hadoopoffice.read.header.read&quot;,&quot;true&quot;)
spark.sparkContext.hadoopConfiguration.set(&quot;hadoopoffice.read.sheets&quot;, &quot;Most Runs Over - 2010&quot;)
spark.sparkContext.setLogLevel(&quot;INFO&quot;)
val schema = StructType(Seq(
StructField(&quot;col&quot;, IntegerType, nullable = true),
StructField(&quot;col1&quot;, StringType, nullable = true),
StructField(&quot;Runs&quot;, IntegerType, nullable = true),
StructField(&quot;BF&quot;, IntegerType, nullable = true),
StructField(&quot;SR&quot;, DoubleType, nullable = true),
StructField(&quot;s4&quot;, IntegerType, nullable = true),
StructField(&quot;s6&quot;, IntegerType, nullable = true),
StructField(&quot;Against&quot;, StringType, nullable = true),
StructField(&quot;Venue&quot;, StringType, nullable = true),
StructField(&quot;Match Date&quot;, StringType, nullable = true),
StructField(&quot;Match Time&quot;, StringType, nullable = true),
StructField(&quot;Match Partition Time&quot;, StringType, nullable = true)
))
val df: Dataset[Row] = spark.read
.format(&quot;org.zuinnote.spark.office.excel&quot;)
.option(&quot;hadoopoffice.read.sheets&quot;, &quot;Most Runs Over - 2010&quot;)
.option(&quot;read.spark.simpleMode&quot;,&quot;true&quot;)
.option(&quot;read.lowFootprint&quot;, &quot;true&quot;)
.option(&quot;multiLine&quot;, &quot;true&quot;)
.option(&quot;read.spark.simpleMode.maxInferRows&quot;,&quot;1000&quot;)
.option(&quot;read.header.read&quot;,&quot;true&quot;)
.schema(schema)
.load(&quot;D:\\excel\\spark-hadoopoffice-ds-s2-ho-1.3.9\\spark-hadoopoffice-ds-s2-ho-1.3.9\\src\\resources\\MostRuns_Over2008.xlsx&quot;)
df.show();

We are facing exception

java.lang.NullPointerException
at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetName(XSSFReader.java:325)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:374)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.parse(MSExcelLowFootprintParser.java:267)
at org.zuinnote.hadoop.office.format.common.OfficeReader.parse(OfficeReader.java:92)
at org.zuinnote.hadoop.office.format.mapreduce.AbstractSpreadSheetDocumentRecordReader.initialize(AbstractSpreadSheetDocumentRecordReader.java:138)
at org.zuinnote.spark.office.excel.HadoopFileExcelReader.&lt;init&gt;(HadoopFileExcelReader.scala:55)
at org.zuinnote.spark.office.excel.DefaultSource.$anonfun$buildReader$4(DefaultSource.scala:322)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:185)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:240)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:832)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1449)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2281)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

Its working if i read all sheets instead of one sheet. Its also working if i dont use low foot print parser.

org.apache.poi version 4.1.2
spark-hadoopoffice-ds version 1.3.9
spark version 3.1.2

答案1

得分: 1

"如果我读取所有工作表而不是一个工作表，它就能正常工作。如果我不使用低内存占用的解析器，它也能正常工作。

建议进行跟踪。在 https://github.com/ZuInnoTe/hadoopoffice/issues 上报告此问题。

空指针异常的原因

java.lang.NullPointerException
at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetName(XSSFReader.java:325)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:374)

MSExcelLowFootprintParser 在程序行 374 调用 iter.getSheetName()，其中 iter 是 XSSFReader.SheetIterator。然后 XSSFReader.SheetIterator.getSheetName 调用 xssfSheetRef.getName()，但 XSSFSheetRef xssfSheetRef 甚至不会被设置为任何值，将为 null，除非调用 XSSFReader.SheetIterator.next。

如果读取所有工作表而不是一个工作表，它就能正常工作，因为然后 MSExcelLowFootprintParser.java 代码行 374 永远不会被执行，因为没有提供工作表名称。然后 while (iter.hasNext()) {... 之后的第一个调用将是 InputStream rawSheetInputStream = iter.next();，并且该调用会调用 XSSFReader.SheetIterator.next，从而设置 xssfSheetRef。

如果不使用低内存占用的解析器，它也能正常工作，因为它根本不使用 org.apache.poi.xssf.eventusermodel.XSSFReader。

解决方案

在 org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage 中，在 while (iter.hasNext()) { 之后需要立即调用 InputStream rawSheetInputStream = iter.next();。这将设置 XSSFReader.SheetIterator.xssfSheetRef，从而避免空指针异常。"

英文:

Its working if i read all sheets instead of one sheet. It is also working if i don't use low footprint parser.

Good observed. File a bug about this on https://github.com/ZuInnoTe/hadoopoffice/issues.

Reason for the NullPointerException

java.lang.NullPointerException
at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetName(XSSFReader.java:325)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:374)

MSExcelLowFootprintParser in program line 374 calls iter.getSheetName() where iter is XSSFReader.SheetIterator. Then XSSFReader.SheetIterator.getSheetName calls xssfSheetRef.getName() but the XSSFSheetRef xssfSheetRef will not even be set to something and will be null unless XSSFReader.SheetIterator.next gets called.

Its working if you read all sheets instead of one sheet because then MSExcelLowFootprintParser.java code line 374 never gets reached because no sheet names are given. Then the first call after while (iter.hasNext()) {... will be InputStream rawSheetInputStream = iter.next(); and that calls XSSFReader.SheetIterator.next which sets xssfSheetRef.

It is also working if you don't use low footprint parser because that not uses org.apache.poi.xssf.eventusermodel.XSSFReader at all.

Solution

InputStream rawSheetInputStream = iter.next(); needs get called immidatelly after while (iter.hasNext()) { in org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage. That will set XSSFReader.SheetIterator.xssfSheetRef and so avoid the NullPointerException.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

java.lang.NullPointerException while reading specific sheet from xlsx using org.zuinnote.spark.office.excel

问题

答案1

使用Apache POI库的Java代码将多个Excel文件合并成一个Excel文件。

如何创建具有标准（荷兰语）标题样式的Word文件？

如何在Vert.x文件系统中读取和解析 .xlsx 文件

如何使用JAVA为Excel标题文本设置字体颜色

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论