英文:
Python code to remove the end of document is not working
问题
我使用python-docx来整理多个Word文档。
以下代码应该查找只包含一个单词且单词不区分大小写且在提供的列表中的段落,然后从文档中删除其余文本。然而,它没有工作。我无法弄清楚原因!
英文:
I am using python-docx to clean up multiple Word documents.
The following code is supposed to find paragraphs which contain only one word and the word is among the list provided, case-insensitive, and then remove the remaining text from the document. However, it is not working. I can't figure out the reason!
import os
import re
from docx import Document
def remove_end(document):
for paragraph in document.paragraphs:
text = paragraph.text.strip().lower()
words_to_check = ['references', 'acknowledgements', 'note', 'notes']
if text in words_to_check and len(paragraph.text.split()) <= 2:
if paragraph not in document.paragraphs:
continue
idx = document.paragraphs.index(paragraph)
del document.paragraphs[idx+1:]
break
document.save(file_path)
答案1
得分: 1
The method for removing paragraphs (del document.paragraphs[idx+1:]
) is not correct. As explained in https://github.com/python-openxml/python-docx/issues/33 removal of paragraphs is not officially implemented/supported in python-docx
. The Github issue mentions a workaround with the warning that this may not work if there are linked items like a picture, hyperlink or chart in the paragraph that you want to delete. However, I tested the workaround on a document with a hyperlink in a deleted paragraph (as that seems the most likely item in a Reference list for example) and that worked fine.
MCVE:
from docx import Document
words_to_check = ['references', 'acknowledgements', 'note', 'notes']
# delete_paragraph from https://github.com/python-openxml/python-docx/issues/33
def delete_paragraph(paragraph):
p = paragraph._element
p.getparent().remove(p)
paragraph._p = paragraph._element = None
def remove_end(document):
del_par = False
for paragraph in document.paragraphs:
if not del_par:
text = paragraph.text.strip().lower()
if text in words_to_check:
delete_paragraph(paragraph)
del_par = True
else:
delete_paragraph(paragraph)
document.save('newdoc.docx')
document = Document('mydoc.docx')
remove_end(document)
This keeps the loop active when a keyword is found, but instead of checking the text it will start deleting the remaining paragraphs.
英文:
The method for removing paragraphs (del document.paragraphs[idx+1:]
) is not correct. As explained in https://github.com/python-openxml/python-docx/issues/33 removal of paragraphs is not officially implemented/supported in python-docx
. The Github issue mentions a workaround with the warning that this may not work if there are linked items like a picture, hyperlink or chart in the paragraph that you want to delete. However, I tested the workaround on a document with a hyperlink in a deleted paragraph (as that seems the most likely item in a Reference list for example) and that worked fine.
MCVE:
from docx import Document
words_to_check = ['references', 'acknowledgements', 'note', 'notes']
# delete_paragraph from https://github.com/python-openxml/python-docx/issues/33
def delete_paragraph(paragraph):
p = paragraph._element
p.getparent().remove(p)
paragraph._p = paragraph._element = None
def remove_end(document):
del_par = False
for paragraph in document.paragraphs:
if not del_par:
text = paragraph.text.strip().lower()
if text in words_to_check:
delete_paragraph(paragraph)
del_par = True
else:
delete_paragraph(paragraph)
document.save('newdoc.docx')
document = Document('mydoc.docx')
remove_end(document)
This keeps the loop active when a keyword is found, but instead of checking the text it will start deleting the remaining paragraphs.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论