Create table instance not connected to a document with python-docx?

huangapple go评论75阅读模式
英文:

Create table instance not connected to a document with python-docx?

问题

这是与Issue #1190相关的X-Post,位于python-docx上游。

这是我当前在docx文档中创建表格的方式。

doc = docx.Document()
tab = doc.add_table(rows=300, cols=5)

因此,表格对象与其父文档"连接"在一起。

是否有一种方式可以创建一个表格对象,而不需要与父对象连接,然后稍后将其添加到文档中?类似这样的方式是否可行?

doc = docx.Document()
tab_one = docx.Table(rows=300, cols=5)
tab_two = docx.Table(rows=100, cols=3)

doc.add_table(tab_two)
doc.add_table(tab_one)

或者(作为一种解决方法),我是否可以像这样将一个表格对象从一个文档实例移动到另一个文档实例?

doc_temp = docx.Document()
tab = doc_temp.add_table(rows=300, cols=5)

doc_main = docx.Document()
doc_main.add_table(tab)

我提出这个问题的背景是,我创建了多个具有100-300行的表格,并对每个单元格进行格式化操作。因此,需要进行大量的行和单元格迭代,这会消耗大量性能和时间。

在多进程中执行此操作,其中每个工作进程都有自己的表格对象,将加速此过程。我想并行创建多个表格,然后稍后将它们添加到文档中。

明显地,多进程并不是性能问题的整个和最佳解决方案。增加更多的CPU资源并不能解决这种问题。算法本身应该得到优化。对我来说,多进程只是通向更好解决方案的一步。

编辑:作为一个现实世界的示例,在这里你可以看到我如何基于pandas.DataFrame对象创建docx表格。

英文:

This is an X-Post related to Issue #1190 at python-docx upstream.

This is how I do create a table in a docx document currently.

doc = docx.Document()
tab = doc.add_table(rows=300, cols=5)

So the table object is "connected" to its parent document.

Is there a way to create a table object without having a connection to a parent object and add it to the document later? Somehow like this?

doc = docx.Document()
tab_one = docx.Table(rows=300, cols=5)
tab_two = docx.Table(rows=100, cols=3)

doc.add_table(tab_two)
doc.add_table(tab_one)

Or (as a workaround) can I move a table object from one document instance to another like this?

doc_temp = docx.Document()
tab = doc_temp.add_table(rows=300, cols=5)

doc_main = docx.Document()
doc_main.add_table(tab)

The background of my question is that I do create multiple tables with 100-300 rows and do formatting operations on each of its cells. So there is a lot of row and cell iterations going on which eat a lot of performance and time.

Doing this in multiprocessing where each worker has its own table object would speed up the process. I would like to create multiple tables in parallel and adding them to the document in a later step.

It is also clear that multiprocessing isn't the whole and best solution for a performance problem. Such a problem isn't solved just with adding more CPU resources into it. The algorithm itself should be optimized. For me the multiprocessing is just one step of the way to a better solution.

EDIT: As a real world example here you can see how I create docx-tables based on pandas.DataFrame objects.

答案1

得分: 1

不,不在python-docx API级别上。通过API操纵的文档元素对象保持对整体连接的lxml对象图的引用,表示一个包部分(如文档、页眉等),并允许您在原地_编辑_该部分。它们不是可以组合和重新组合的独立组件。

也就是说,lxml允许你所说的操作,将预先形成的XML子树的副本插入到现有XML树的任意位置。因此,如果你愿意深入挖掘,你可以使用python-docx形成初始表格子树,然后使用lxml复制它并插入到XML的其他位置。

python-docx也可以帮助你定位它们的位置,至少让你接近目标。例如,你可以找到一个标记段落并获取它的<p>元素,使用paragraph._p,然后使用lxml<tbl>子树插入到<p>元素之前或之后作为同级。

这对于不怕吃苦的人来说,因为它需要深入研究代码并理解lxml,但这绝对是一个可行的方法,已经为其他人工作过,比起从.docx文件开始完成相同的工作要容易得多。挑战的一部分是处理.docx文件的解组,每个部分都有自己的XML树,而python-docx可以为您处理所有这些,以及在进行更改后重新组合包(保存)。

英文:

No, not at the python-docx API level. The document-element objects you manipulate via the API maintain a reference into the overall connected graph of lxml objects representing a package part (like document, header, etc.) and allow you to edit that part, in situ. They are not free-standing components that you can compose and re-compose.

That said, lxml does allow what you're talking about, inserting a copy of a pre-formed XML subtree into an arbitrary position in an existing XML tree. So if you're willing to dig down, you could use python-docx to form the initial table subtree, then use lxml to copy it and insert it into other places in the XML.

python-docx can also help you locate where those go, at least getting you close. For example you could find a marker paragraph and get its &lt;p&gt; element using paragraph._p then use lxml to insert a &lt;tbl&gt; subtree as a sibling before or after the &lt;p&gt; element.

This is not for the faint of heart as it will require digging into the code and understanding lxml, but it's certainly a viable approach that's worked for others and a lot easier than accomplishing the same starting from a .docx file. A big part of the challenge is handling the unmarshalling of the .docx file into parts, each with their own XML tree and python-docx can handle all that for you, as well as re-marshalling the package (saving) once you've made your changes.

huangapple
  • 本文由 发表于 2023年5月17日 16:30:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76270036.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定