从Excel电子表格中手动读取插入的文本

huangapple go评论87阅读模式
英文:

Read manually inserted text from an Excel spreadsheet

问题

我有一个包含我大学课程表的 .xlsx 文件。我正在开发一个利用这个课程表的应用程序。但我不想把课程表的内容从这个 Excel 电子表格复制到更“适合程序员”的格式中,相反,我想编写一个程序/脚本,可以解析这个 .xlsx 表格,并自动将其转换为我需要的格式(例如,转换为代码中的某些对象)。

对于阅读电子表格中的“常规”单元格,我没有问题。然而,与其简单地在每个单元格中放入一个文本条目,创建这个课程表文件的人手动地将某些单元格“分割”成了“子单元格”,并在每个子单元格中手动插入了一些文本。看起来像这样:从Excel电子表格中手动读取插入的文本

> 这应该如何解释: 学生被分成了4组。在 15.20-16.50 时间段,只有1号和2号组将上特定课程。在 17.00-18.30 时间段,只有1号、3号和4号组将上那门课程。

正如大家所看到的,这些“单元格”并不是真正的单元格 - 它们似乎是手动创建(“分割”)的,就像图片中选择的文本一样。

问题是:我如何找到并读取这些“单元格”(手动插入的文本部分),就像图片中的情况一样(最好还知道它们的位置,这样我不仅可以读取存在的课程,还可以知道它们何时开始(时间在电子表格的最左边显示))

我尝试过使用 Pythonxlrd 模块,但无法实现我所需的功能。我在 JavaApache POI 也没有成功 - 我就是找不到如何读取这些文本条目。无论使用什么语言,使用什么库和方法,只要能解决这个问题都可以。

英文:

I have a .xlsx file containing my university's timetable. I'm working on an application that makes use of the timetable. But I don't want to "copy" the timetable contents from this Excel spreadsheet into a more "programmer-friendly" format, instead, I'd like to write a program/script that would parse this .xlsx table and automatically convert it in the format I need (e.g. in some objects in code).

There's no trouble for me in reading "normal" cells of the spreadsheet. However, instead of simply putting 1 text entry in each cell, the person who created this timetable file manually "divided" some cells into "subcells" and manually inserted some text in each of them. This looks like: 从Excel电子表格中手动读取插入的文本

> How should this be interpreted: students are divided into 4 groups. At 15.20-16.50 only groups number 1 and 2 will have a specific class. At 17.00-18.30 only groups 1, 3, and 4 will have that class.

As one can see, these "cells" are not real cells — they seem to have been created ("divided") manually, just like the text that is selected in the picture.

The question is: how do I find and read such "cells" (manually inserted text components) like in the picture (preferably also knowing their position so that I can not only read what classes exist, but also when they start (time is stated in the very left of the spreadsheet))?

I tried using Python's xlrd module but haven't been able to achieve what I need. Neither have I had any success with Java's Apache POI — I just can't find how to read such text entries. Solutions on both languages, no matter what libraries and approaches are used, will be fine for me.

答案1

得分: 2

Both xls and xslx are proprietary formats. Microsoft went out of their way to explain in court that xslx is open, but unfortunately not one of the judges involved knew anything significant about computer science and the lawyers knew it, so don't get distracted by their misleading case. XSLX has the option for the 'vendor' to add a block of 'custom binary blobs' and the vast majority of the excel features that aren't the most common, lowest level stuff imaginable are in these binary blobs.

Microsoft has never released any documentation on these binary blobs, nor any library that can parse them.

Therefore, Apache POI, xlrd, and all other libraries to read XLS files that do not explicitly require Excel to be installed and running on the computer that's running the 'library' (kind of a tricky thing to pull if you have e.g. a linux-based server!) are based on reverse engineering it, and it's a horrible format. Literally - look up what Apache POI's 'HSSF' stands for. Officially nothing, but etymologically, that H is for Horrible. (Horrible Spread Sheet Format - HSSF).

That's the long way around of saying: Sorry - you probably can't. And it's not the fault of POI or xlrd, it's on microsoft. It is not appropriate to use such a closed, proprietary and undocumented format to transfer anything meaningful.

The error lies in whatever process led to the situation that you're now stuck trying to write software to parse a weird excel file.

If you must, most likely a script running within excel can untangle this mess and write out a CSV file or JSON or something in a documented format. Alternatively, you can write something in C#, but it would just be farming out the work to excel, so, you still would not be able to port this code to other platforms.

Apache POI does give you the option of a more low-level approach where you can read the binary blobs. You can attempt to reverse engineer whatever's going on in that 'cell-with-a-table-in-it' yourself, but as neither the xlrd team nor the Apache POI team has bothered, and at least the POI team is on record as saying the format seems to be designed to be obfuscated - that sounds like a job that will take you many, many weeks.

That gets me back to the solution I advised earlier: Unless spending many weeks building an incredibly fragile stack that requires a full-blown windows and an excel license is the lesser evil compared to a simple change in human behaviour (unlikely), the fix lies in addressing the process (as in, address that excel is used to transfer this info, or at least make the excel sheet much simpler than this thing), and not by finding out how to read this mess in java or python.

英文:

Both xls and xslx are proprietary formats. Microsoft went out of their way to explain in court that xslx is open, but unfortunately not one of the judges involved knew anything significant about computer science and the lawyers knew it, so don't get distracted by their misleading case. XSLX has the option for the 'vendor' to add a block of 'custom binary blobs' and the vast majority of the excel features that aren't the most common, lowest level stuff imaginable are in these binary blobs. No doubt this 'stick a text table object into a single cell' thing that's going on here is exactly like that.

Microsoft has never released any documentation on these binary blobs, nor any library that can parse them.

Therefore, Apache POI, xlrd, and all other libraries to read XLS files that do not explicitly require Excel to be installed and running on the computer that's running the 'library' (kind of a tricky thing to pull if you have e.g. a linux-based server!) are based on reverse engineering it, and it's a horrible format. Literally - look up what Apache POI's 'HSSF' stands for. Officially nothing, but etymologically, that H is for Horrible. (Horrible Spread Sheet Format - HSSF).

That's the long way around of saying: Sorry - you probably can't. And it's not the fault of POI or xlrd, it's on microsoft. It is not appropriate to use such a closed, proprietary and undocumented format to transfer anything meaningful. The error lies in whatever process led to the situation that you're now stuck trying to write software to parse a weird excel file.

If you must, most likely a script running within excel can untangle this mess and write out a csv file or json or something in a documented format. Alternatively, you can write something in C#, but it would just be farming out the work to excel, so, you still would not be able to port this code to other platforms.

Apache POI does give you the option of a more low-level approach where you can read the binary blobs. You can attempt to reverse engineer whatever's going on in that 'cell-with-a-table-in-it' yourself, but as neither the xlrd team nor the Apache POI team has bothered, and at least the POI team is on record as saying the format seems to be designed to be obfuscated - that sounds like a job that will take you many, many weeks.

That gets me back to the solution I advised earlier: Unless spending many weeks building an incredibly fragile stack that requires a full blown windows and an excel license is the lesser evil compared to a simple change in human behaviour (unlikely), the fix lies in addressing the process (as in, address that excel is used to transfer this info, or at least make the excel sheet muuuch simpler than this thing), and not by finding out how to read this mess in java or python.

huangapple
  • 本文由 发表于 2020年10月22日 05:15:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/64471849.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定