英文:
Error when parsing an embedded .xlsx file from a .ppt using apache-poi. The supplied POIFSFileSystem does not contain a BIFF8 'Workbook' entry
问题
我在使用Apache POI从.ppt文件中提取嵌入的.xlsx文件时遇到了问题。如果有人能帮助我,那将非常棒。
问题的主题:
试图解决的问题:从“.ppt”文件中提取嵌入的“.xlsx”文件。
我目前正在使用apache-poi。
似乎当我尝试使用hslfSlideShow.getEmbeddedObjects()来进行操作时,我可以很好地获取xlsx对象,但是当我尝试使用WorkbookFactory.create(inputStream)将其转换为XLSFWorkbook对象时,它抛出了错误,错误消息如下:
java.lang.IllegalArgumentException: 所提供的POIFSFileSystem不包含BIFF8 'Workbook'条目。它真的是一个excel文件吗?拥有:[OlePres000, Ole, CompObj, Package]
at org.apache.poi.hssf.usermodel.HSSFWorkbook.getWorkbookDirEntryName(HSSFWorkbook.java:286)
at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:326)
at org.apache.poi.hssf.usermodel.HSSFWorkbookFactory.createWorkbook(HSSFWorkbookFactory.java:64)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:167)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:112)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:253)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:221)
有趣的是,尽管是xlsx文件,它却在调用HSSFWorkbookFactory。
并且**不,xlsx文件没有损坏/受密码保护**。我可以正常打开它。
而且,如果我尝试解析**未**嵌入在.ppt中的.xlsx文件,它也能正常工作。
而且,在将其嵌入到.pptx文件中并调用xmlSlideShow.getAllEmbeddedParts()之类的方法从.pptx中获取嵌入的对象时,解析也能正常工作。
英文:
I am facing an issue when using apache poi to extract an embedded .xlsx files from a .ppt file. It would be really great if somebody could help me out.
The subject of the problem:
Problem trying to solve: Extracting a ".xlsx" file embedded inside a ".ppt".
I am currently using apache-poi.
It seems that when I try to do it using hslfSlideShow.getEmbeddedObjects(), I get the xlsx object just fine but when I try converting it to the XLSFWorkbook object using say WorkbookFactory.create(inputStream), it threw an error saying
java.lang.IllegalArgumentException: The supplied POIFSFileSystem does not contain a BIFF8 'Workbook' entry. Is it really an excel file? Had: [OlePres000, Ole, CompObj, Package]
at org.apache.poi.hssf.usermodel.HSSFWorkbook.getWorkbookDirEntryName(HSSFWorkbook.java:286)
at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:326)
at org.apache.poi.hssf.usermodel.HSSFWorkbookFactory.createWorkbook(HSSFWorkbookFactory.java:64)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:167)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:112)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:253)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:221)
Interestingly it is calling HSSFWorkbookFactory even though its an xlsx file.
And no the xlsx file is not corrupted/password-protected. I can open it just fine.
Also, it works fine if I try parsing the .xlsx file without embedding it in the .ppt.
And the parsing works fine when I embed it in a .pptx file and call methods such as xmlSlideShow.getAllEmbeddedParts() to get the embedded objects from .pptx.
答案1
得分: 1
促进一些评论和调查以形成答案...
这是在较旧版本的Apache POI中存在的限制,但在今年7月的r1880164中得到了修复。
出于向后兼容的原因,PowerPoint通常(但并不总是...)会将嵌入的OOXML资源写入一个中间的OLE2层。这样做的好处是,期望嵌入式办公文档类似于xls
/ doc
的工具/程序可以处理,但代价是增加了另一层包装。
较新版本的Apache POI(5.0应该是首个带有修复的发布版本)在WorkbookFactory
中具有接收这种OLE2包装的支持,可以提取出底层的xlsx
流并将其传递给XSSFWorkbook
。(较旧版本对基于OLE2的受密码保护的xlsx
文件执行此操作,但不适用于其未加密的同类文件)
如果您目前使用受影响的POI版本,您需要的代码可能类似于以下内容(主要是从验证支持的单元测试中提取的!):
POIFSFileSystem fs = new POIFSFileSystem(data.getInputStream());
if (fs.getRoot().hasEntry("Package")) {
DocumentInputStream dis = new DocumentInputStream((DocumentEntry)fs.getRoot().getEntry("Package"));
try (OPCPackage pkg = OPCPackage.open(dis)) {
XSSFWorkbook wb = new XSSFWorkbook(pkg);
handleWorkbook(wb);
wb.close();
}
} else {
try (HSSFWorkbook wb = new HSSFWorkbook(fs)) {
handleWorkbook(wb);
}
}
英文:
Promoting some comments and investigation to an answer...
This was a limitation in older version of Apache POI, but was fixed in July in r1880164.
For backwards-compatibility reasons, PowerPoint will often (but not always...) write embedded OOXML resources wrapped in an intermediate OLE2 layer. This has the advantage that tools/programs which expect embedded office documents to be something like a xls
/ doc
to cope, but at the expense of another layer of wrapping.
Newer versions of Apache POI (5.0 should be the first released one with the fix in) have support in WorkbookFactory
for receiving an OLE2 wrapper like this, pulling out the underlying xlsx
stream and handing that off to XSSFWorkbook
. (Older versions did this for OLE2-based password-protected xlsx
files, but not their unencrypted cousins)
For now, if you're stuck on an affected POI version, the code you'll want is something like this (largely taken from the unit test verifying support!):
POIFSFileSystem fs = new POIFSFileSystem(data.getInputStream());
if(fs.getRoot().hasEntry("Package")) {
DocumentInputStream dis = new DocumentInputStream((DocumentEntry)fs.getRoot().getEntry("Package"));
try (OPCPackage pkg = OPCPackage.open(dis)) {
XSSFWorkbook wb = new XSSFWorkbook(pkg);
handleWorkbook(wb);
wb.close();
}
} else {
try (HSSFWorkbook wb = new HSSFWorkbook(fs)) {
handleWorkbook(wb);
}
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论