问题

我使用python-docx来整理多个Word文档。

以下代码应该查找只包含一个单词且单词不区分大小写且在提供的列表中的段落，然后从文档中删除其余文本。然而，它没有工作。我无法弄清楚原因！

英文:

I am using python-docx to clean up multiple Word documents.

The following code is supposed to find paragraphs which contain only one word and the word is among the list provided, case-insensitive, and then remove the remaining text from the document. However, it is not working. I can't figure out the reason!

import os
import re
from docx import Document

def remove_end(document):
    for paragraph in document.paragraphs:
        text = paragraph.text.strip().lower()
        words_to_check = [&#39;references&#39;, &#39;acknowledgements&#39;, &#39;note&#39;, &#39;notes&#39;]
        if text in words_to_check and len(paragraph.text.split()) &lt;= 2:
            if paragraph not in document.paragraphs:
                continue
            idx = document.paragraphs.index(paragraph)
            del document.paragraphs[idx+1:]
            break
    document.save(file_path)

答案1

得分: 1

The method for removing paragraphs (del document.paragraphs[idx+1:]) is not correct. As explained in https://github.com/python-openxml/python-docx/issues/33 removal of paragraphs is not officially implemented/supported in python-docx. The Github issue mentions a workaround with the warning that this may not work if there are linked items like a picture, hyperlink or chart in the paragraph that you want to delete. However, I tested the workaround on a document with a hyperlink in a deleted paragraph (as that seems the most likely item in a Reference list for example) and that worked fine.

MCVE:

from docx import Document

words_to_check = ['references', 'acknowledgements', 'note', 'notes']

# delete_paragraph from https://github.com/python-openxml/python-docx/issues/33
def delete_paragraph(paragraph):
    p = paragraph._element
    p.getparent().remove(p)
    paragraph._p = paragraph._element = None

def remove_end(document):
   del_par = False
   for paragraph in document.paragraphs:
      if not del_par:
         text = paragraph.text.strip().lower()
         if text in words_to_check:
            delete_paragraph(paragraph)
            del_par = True      
      else:
         delete_paragraph(paragraph)
         
   document.save('newdoc.docx')

document = Document('mydoc.docx')
remove_end(document)

This keeps the loop active when a keyword is found, but instead of checking the text it will start deleting the remaining paragraphs.

英文:

MCVE:

from docx import Document

words_to_check = [&#39;references&#39;, &#39;acknowledgements&#39;, &#39;note&#39;, &#39;notes&#39;]

# delete_paragraph from https://github.com/python-openxml/python-docx/issues/33
def delete_paragraph(paragraph):
    p = paragraph._element
    p.getparent().remove(p)
    paragraph._p = paragraph._element = None

def remove_end(document):
   del_par = False
   for paragraph in document.paragraphs:
      if not del_par:
         text = paragraph.text.strip().lower()
         if text in words_to_check:
            delete_paragraph(paragraph)
            del_par = True      
      else:
         delete_paragraph(paragraph)
         
   document.save(&#39;newdoc.docx&#39;)

document = Document(&#39;mydoc.docx&#39;)
remove_end(document)

This keeps the loop active when a keyword is found, but instead of checking the text it will start deleting the remaining paragraphs.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python代码无法移除文档末尾。

问题

答案1

绕过 Twitter 上的 FunCaptcha。

按层次分组多列排序

如何找到包含将代币转移到Polygon zkEVM桥接合约的正确ABI函数？

创建新列基于缺失值

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论