2020年10月22日 05:15:33go评论192阅读模式

英文:

Read manually inserted text from an Excel spreadsheet

问题

我有一个包含我大学课程表的 .xlsx 文件。我正在开发一个利用这个课程表的应用程序。但我不想把课程表的内容从这个 Excel 电子表格复制到更“适合程序员”的格式中，相反，我想编写一个程序/脚本，可以解析这个 .xlsx 表格，并自动将其转换为我需要的格式（例如，转换为代码中的某些对象）。

对于阅读电子表格中的“常规”单元格，我没有问题。然而，与其简单地在每个单元格中放入一个文本条目，创建这个课程表文件的人手动地将某些单元格“分割”成了“子单元格”，并在每个子单元格中手动插入了一些文本。看起来像这样：。

> 这应该如何解释： 学生被分成了4组。在 15.20-16.50 时间段，只有1号和2号组将上特定课程。在 17.00-18.30 时间段，只有1号、3号和4号组将上那门课程。

正如大家所看到的，这些“单元格”并不是真正的单元格 - 它们似乎是手动创建（“分割”）的，就像图片中选择的文本一样。

问题是：我如何找到并读取这些“单元格”（手动插入的文本部分），就像图片中的情况一样（最好还知道它们的位置，这样我不仅可以读取存在的课程，还可以知道它们何时开始（时间在电子表格的最左边显示））？

我尝试过使用 Python 的 xlrd 模块，但无法实现我所需的功能。我在 Java 的 Apache POI 也没有成功 - 我就是找不到如何读取这些文本条目。无论使用什么语言，使用什么库和方法，只要能解决这个问题都可以。

英文:

I have a .xlsx file containing my university's timetable. I'm working on an application that makes use of the timetable. But I don't want to "copy" the timetable contents from this Excel spreadsheet into a more "programmer-friendly" format, instead, I'd like to write a program/script that would parse this .xlsx table and automatically convert it in the format I need (e.g. in some objects in code).

There's no trouble for me in reading "normal" cells of the spreadsheet. However, instead of simply putting 1 text entry in each cell, the person who created this timetable file manually "divided" some cells into "subcells" and manually inserted some text in each of them. This looks like:

> How should this be interpreted: students are divided into 4 groups. At 15.20-16.50 only groups number 1 and 2 will have a specific class. At 17.00-18.30 only groups 1, 3, and 4 will have that class.

As one can see, these "cells" are not real cells — they seem to have been created ("divided") manually, just like the text that is selected in the picture.

The question is: how do I find and read such "cells" (manually inserted text components) like in the picture (preferably also knowing their position so that I can not only read what classes exist, but also when they start (time is stated in the very left of the spreadsheet))?

I tried using Python's xlrd module but haven't been able to achieve what I need. Neither have I had any success with Java's Apache POI — I just can't find how to read such text entries. Solutions on both languages, no matter what libraries and approaches are used, will be fine for me.

答案1

得分: 2

Both xls and xslx are proprietary formats. Microsoft went out of their way to explain in court that xslx is open, but unfortunately not one of the judges involved knew anything significant about computer science and the lawyers knew it, so don't get distracted by their misleading case. XSLX has the option for the 'vendor' to add a block of 'custom binary blobs' and the vast majority of the excel features that aren't the most common, lowest level stuff imaginable are in these binary blobs.

Microsoft has never released any documentation on these binary blobs, nor any library that can parse them.

Therefore, Apache POI, xlrd, and all other libraries to read XLS files that do not explicitly require Excel to be installed and running on the computer that's running the 'library' (kind of a tricky thing to pull if you have e.g. a linux-based server!) are based on reverse engineering it, and it's a horrible format. Literally - look up what Apache POI's 'HSSF' stands for. Officially nothing, but etymologically, that H is for Horrible. (Horrible Spread Sheet Format - HSSF).

The error lies in whatever process led to the situation that you're now stuck trying to write software to parse a weird excel file.

If you must, most likely a script running within excel can untangle this mess and write out a CSV file or JSON or something in a documented format. Alternatively, you can write something in C#, but it would just be farming out the work to excel, so, you still would not be able to port this code to other platforms.

Apache POI does give you the option of a more low-level approach where you can read the binary blobs. You can attempt to reverse engineer whatever's going on in that 'cell-with-a-table-in-it' yourself, but as neither the xlrd team nor the Apache POI team has bothered, and at least the POI team is on record as saying the format seems to be designed to be obfuscated - that sounds like a job that will take you many, many weeks.

That gets me back to the solution I advised earlier: Unless spending many weeks building an incredibly fragile stack that requires a full-blown windows and an excel license is the lesser evil compared to a simple change in human behaviour (unlikely), the fix lies in addressing the process (as in, address that excel is used to transfer this info, or at least make the excel sheet much simpler than this thing), and not by finding out how to read this mess in java or python.

英文:

Microsoft has never released any documentation on these binary blobs, nor any library that can parse them.

That's the long way around of saying: Sorry - you probably can't. And it's not the fault of POI or xlrd, it's on microsoft. It is not appropriate to use such a closed, proprietary and undocumented format to transfer anything meaningful. The error lies in whatever process led to the situation that you're now stuck trying to write software to parse a weird excel file.

If you must, most likely a script running within excel can untangle this mess and write out a csv file or json or something in a documented format. Alternatively, you can write something in C#, but it would just be farming out the work to excel, so, you still would not be able to port this code to other platforms.

That gets me back to the solution I advised earlier: Unless spending many weeks building an incredibly fragile stack that requires a full blown windows and an excel license is the lesser evil compared to a simple change in human behaviour (unlikely), the fix lies in addressing the process (as in, address that excel is used to transfer this info, or at least make the excel sheet muuuch simpler than this thing), and not by finding out how to read this mess in java or python.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从Excel电子表格中手动读取插入的文本

问题

答案1

这是 XML 内的前三个字符。

如何使用Peewee创建一个分区表？

Spring-boot 和 Jackson 属性中的多态性

Cannot use class in jar's root when my file is part of a package but when my file is not part of a package, I can

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论