Python代码无法移除文档末尾。

huangapple go评论67阅读模式
英文:

Python code to remove the end of document is not working

问题

我使用python-docx来整理多个Word文档。

以下代码应该查找只包含一个单词且单词不区分大小写且在提供的列表中的段落,然后从文档中删除其余文本。然而,它没有工作。我无法弄清楚原因!

英文:

I am using python-docx to clean up multiple Word documents.

The following code is supposed to find paragraphs which contain only one word and the word is among the list provided, case-insensitive, and then remove the remaining text from the document. However, it is not working. I can't figure out the reason!

import os
import re
from docx import Document

def remove_end(document):
    for paragraph in document.paragraphs:
        text = paragraph.text.strip().lower()
        words_to_check = ['references', 'acknowledgements', 'note', 'notes']
        if text in words_to_check and len(paragraph.text.split()) <= 2:
            if paragraph not in document.paragraphs:
                continue
            idx = document.paragraphs.index(paragraph)
            del document.paragraphs[idx+1:]
            break
    document.save(file_path)

答案1

得分: 1

The method for removing paragraphs (del document.paragraphs[idx+1:]) is not correct. As explained in https://github.com/python-openxml/python-docx/issues/33 removal of paragraphs is not officially implemented/supported in python-docx. The Github issue mentions a workaround with the warning that this may not work if there are linked items like a picture, hyperlink or chart in the paragraph that you want to delete. However, I tested the workaround on a document with a hyperlink in a deleted paragraph (as that seems the most likely item in a Reference list for example) and that worked fine.

MCVE:

from docx import Document

words_to_check = ['references', 'acknowledgements', 'note', 'notes']

# delete_paragraph from https://github.com/python-openxml/python-docx/issues/33
def delete_paragraph(paragraph):
    p = paragraph._element
    p.getparent().remove(p)
    paragraph._p = paragraph._element = None

def remove_end(document):
   del_par = False
   for paragraph in document.paragraphs:
      if not del_par:
         text = paragraph.text.strip().lower()
         if text in words_to_check:
            delete_paragraph(paragraph)
            del_par = True      
      else:
         delete_paragraph(paragraph)
         
   document.save('newdoc.docx')

document = Document('mydoc.docx')
remove_end(document)

This keeps the loop active when a keyword is found, but instead of checking the text it will start deleting the remaining paragraphs.

英文:

The method for removing paragraphs (del document.paragraphs[idx+1:]) is not correct. As explained in https://github.com/python-openxml/python-docx/issues/33 removal of paragraphs is not officially implemented/supported in python-docx. The Github issue mentions a workaround with the warning that this may not work if there are linked items like a picture, hyperlink or chart in the paragraph that you want to delete. However, I tested the workaround on a document with a hyperlink in a deleted paragraph (as that seems the most likely item in a Reference list for example) and that worked fine.

MCVE:

from docx import Document

words_to_check = ['references', 'acknowledgements', 'note', 'notes']

# delete_paragraph from https://github.com/python-openxml/python-docx/issues/33
def delete_paragraph(paragraph):
    p = paragraph._element
    p.getparent().remove(p)
    paragraph._p = paragraph._element = None

def remove_end(document):
   del_par = False
   for paragraph in document.paragraphs:
      if not del_par:
         text = paragraph.text.strip().lower()
         if text in words_to_check:
            delete_paragraph(paragraph)
            del_par = True      
      else:
         delete_paragraph(paragraph)
         
   document.save('newdoc.docx')

document = Document('mydoc.docx')
remove_end(document)

This keeps the loop active when a keyword is found, but instead of checking the text it will start deleting the remaining paragraphs.

huangapple
  • 本文由 发表于 2023年4月13日 23:36:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76007313.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定