How to automatically detect corrupted Excel files generated with ArcGIS 10.6/xlswriter/SpatiaLite using Python 2.7

huangapple go评论67阅读模式
英文:

How to automatically detect corrupted Excel files generated with ArcGIS 10.6/xlswriter/SpatiaLite using Python 2.7

问题

我在ArcGIS 10.6中创建了一个Python工具箱,用于从SpatiaLite数据库中生成Excel报告,使用了xlsxwriter库。我使用的是Excel 10和Python 2.7。

Excel文件包括几个查询结果和照片。大多数情况下都正常工作,但对于某些要素,会出现以下打开错误消息:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<logFileName>error012280_01.xml</logFileName>
<summary>Errors were detected in file 'C:\Test\Data\ExcelReports\R90GGH.xlsx'</summary>
<removedRecords>
<removedRecord>Removed Records: Named range from /xl/workbook.xml part (Workbook)</removedRecord>
</removedRecords>
</recoveryLog>

Excel会提示恢复数据,确实在接受了一些数据丢失后打开文件,保存修复后的文件后,一切都正常。

我没有发现工作正常和不正常的报告数据之间有任何区别。

我尝试了将文件保存为zip然后解压缩并检查XML文件的方法,但我在那里没有看到错误(尽管很难检测,因为XML文件的密度很高)。

我曾认为这与打印区域的定义或重复文件或分页符有关,所以我尝试注释掉这些行,但错误仍然存在。

我可以接受保存错误的报告,但问题是找出哪些是错误的,因为有300多个报告。

我使用了在这个网站上找到的脚本,使用xlrd自动打开并检查错误,但根据脚本,所有Excel文件都没有问题。

def test_book(filename):
    try:
        book = open_workbook(filename)
        try:
            sheet = book.sheet_by_index(0)
            b6 = sheet.cell_value(rowx=5, colx=1) #b6是内部代码
            print("'" + str(sheet.name) + "',")
            return True
        
        except XLRDError:
            return False
            print("error")
    except Exception as e:
        return False
        print("error")

我想找到错误,但我的问题是:如何自动检测坏的Excel文件,因为手动检查我的300个文件花费了一个多小时,而这些报告是周期性的。

我之前在GIS Stack Exchange上提出过这个问题,但他们认为我的问题与GIS没有直接关系。

编辑:这是workbook.xml的内容。

英文:

I created a Python toolbox in ArcGIS 10.6 to produce Excel reports out of a SpatiaLite DB, using the library xlsxwriter. I am using Excel 10 and Python 2.7.

The Excel file includes the results of several queries and photos too. It works fine most of the times but for certain features produces the following opening error message:

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; standalone=&quot;yes&quot;?&gt;
&lt;recoveryLog xmlns=&quot;http://schemas.openxmlformats.org/spreadsheetml/2006/main&quot;&gt;&lt;logFileName&gt;error012280_01.xml&lt;/logFileName&gt;&lt;summary&gt;Errors were detected in file &#39;C:\Test\Data\ExcelReports\R90GGH.xlsx&#39;&lt;/summary&gt;&lt;removedRecords&gt;&lt;removedRecord&gt;Removed Records: Named range from /xl/workbook.xml part (Workbook)&lt;/removedRecord&gt;&lt;/removedRecords&gt;&lt;/recoveryLog&gt;

Excel prompts to recover data, and indeed it opens the file after accepting the loss of some data, and after I saved the repaired file, apparently all is in order.

I have not found any difference among the data of the reports that work and of the reports that don't.

I used the trick of saving as zip and then unzip and check the XMLs but I do not see errors there (although it would be hard to detect because the density of the XML)

I thought it was something to do with the definition of the print area or repeating files or the page breaks so I tested commenting those lines, but the error persisted.

I can live with saving the bad reports but the problem is to find which one are the bad ones, since they are over 300.

I used a script which I found in this site to automatically opening and checking for errors using xlrd, but according to the script all the Excel files are OK.

def test_book(filename):
    try:
        book = open_workbook(filename)
        try:
            sheet = book.sheet_by_index(0)
            b6 = sheet.cell_value(rowx=5, colx=1) #b6 an internal code
            print &quot;&#39;&quot;+ str(sheet.name) + &quot;&#39;,&quot;
            return True
        
        except XLRDError:
            return False
            print &quot;error&quot;
    except Exception as e:
        return False
        print &quot;error&quot;

I would like to find the error, but my question is: how can I detect the bad Excel files automatically, since to check my 300 files manually I spent more than an hour and these reports are periodic.

I had previously asked in GIS Stack Exchange but they considered my question not was in direct relation with GIS.

EDIT:
This is workbook.xml

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; standalone=&quot;yes&quot;?&gt;
&lt;workbook xmlns=&quot;http://schemas.openxmlformats.org/spreadsheetml/2006/main&quot; xmlns:r=&quot;http://schemas.openxmlformats.org/officeDocument/2006/relationships&quot;&gt;&lt;fileVersion appName=&quot;xl&quot; lastEdited=&quot;4&quot; lowestEdited=&quot;4&quot; rupBuild=&quot;4505&quot;/&gt;&lt;workbookPr defaultThemeVersion=&quot;124226&quot;/&gt;&lt;bookViews&gt;&lt;workbookView xWindow=&quot;240&quot; yWindow=&quot;15&quot; windowWidth=&quot;16095&quot; windowHeight=&quot;9660&quot;/&gt;&lt;/bookViews&gt;&lt;sheets&gt;&lt;sheet name=&quot;R7MB22&quot; sheetId=&quot;1&quot; r:id=&quot;rId1&quot;/&gt;&lt;/sheets&gt;&lt;definedNames&gt;&lt;definedName name=&quot;_xlnm.Print_Area&quot; localSheetId=&quot;0&quot;&gt;R7MB22!$A$1:$H$137&lt;/definedName&gt;&lt;definedName name=&quot;_xlnm.Print_Titles&quot; localSheetId=&quot;0&quot;&gt;R7MB22!$1:$2&lt;/definedName&gt;&lt;/definedNames&gt;&lt;calcPr calcId=&quot;124519&quot; fullCalcOnLoad=&quot;1&quot;/&gt;&lt;/workbook&gt;

答案1

得分: 2

这个问题会很难确定一个文件是否损坏,除非使用Excel读取文件,因为XlsxWriter生成的文件很少包含像XML错误那样可以被第三方工具捕获的内容。一般来说,这些问题通常是由于类似非法公式之类的原因引起的。在这种情况下,看起来是与命名区域有关,但这可能与实际的命名区域、自动筛选范围或打印区域有关。

更好的方法是找出是什么导致了文件损坏,并避免执行这样的操作。XlsxWriter会努力避免创建Excel认为有错误的文件,但这种情况有时仍会发生。

如果你在XlsxWriter的bug报告中附上一个损坏的文件,我会查看它。

更新:bug报告在这里:#963。这个问题是由一个工作表名称看起来像RC范围并且应该用引号引起来引起的。

英文:

It is going to be difficult to figure out if a file is corrupted without using Excel to read the file since the files produces by XlsxWriter rarely include something like an XML error that could be caught be a third party tool. In general these types of issues occur due to something like a very invalid formula. It this case it looks like an issue with names ranges but that could related to actual names ranges or autofilter ranges or print areas.

A better approach would be to figure out what is causing the corruption and avoid doing that. XlsxWriter works hard to avoid creating a file that Excel considers has an error but it can happen.

If you open a bug report against XlsxWriter with one of the corrupt files I will have a look at it.

Update: The bug report is here: #963. The issue is caused by a worksheet name that looks like a RC range and should be quoted.

huangapple
  • 本文由 发表于 2023年3月7日 21:34:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75662658.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定