英文:
How to extract pictures as enhanced metafile from word documents in python?
问题
以下是翻译好的部分:
我想以自动方式从Word文档中提取图像。这些图像是作为图片(增强的图元文件)粘贴到Word文档中的Excel图表。
在进行快速研究后,我尝试使用以下方法:
import docx2txt as d2t
def extract_images_from_docx(path_to_file, images_folder, get_text = False):
text = d2t.process(path_to_file, images_folder)
if get_text:
return text
path_to_file = './Report.docx'
images_folder = './Img/'
extract_images_from_docx(path_to_file, images_folder, False)
然而,这种方法不起作用。我几乎可以确定这是因为图片的格式。事实上,当我将普通的PNG图像粘贴到一个Word文档中,然后可以使用上述代码获取它。
我还尝试将文档转换为PDF,并尝试从中提取图像,但没有结果:
from docx2pdf import convert
convert('./Report.docx')
convert('./Report.docx', './Report.pdf')
import fitz # PyMuPDF
def get_pixmaps_in_pdf(pdf_filename):
doc = fitz.open(pdf_filename)
xrefs = set()
for page_index in range(doc.page_count):
for image in doc.get_page_images(page_index):
xrefs.add(image[0]) # 将XREF添加到集合中,以忽略重复项
pixmaps = [fitz.Pixmap(doc, xref) for xref in xrefs]
doc.close()
return pixmaps
def write_pixmaps_to_pngs(pixmaps):
for i, pixmap in enumerate(pixmaps):
pixmap.save(f'{i}.png') # 可能需要想出更好的名称
pixmaps = get_pixmaps_in_pdf('./Report.pdf')
write_pixmaps_to_pngs(pixmaps)
所以,是否有人知道是否有一种方法可以自动提取作为增强图元文件粘贴在Word文档中的Excel图表?
非常感谢您的帮助!
英文:
I want to extract in an automatic way images from a word document. The images are excel charts pasted as picture (enhanced metafile) into the worddoc.
After a quick research I tried to use the following method
import docx2txt as d2t
def extract_images_from_docx(path_to_file, images_folder, get_text = False):
text = d2t.process(path_to_file, images_folder)
if get_text:
return text
path_to_file = './Report.docx'
images_folder = './Img/'
extract_images_from_docx(path_to_file, images_folder, False)
However, this method does NOT work. I am almost sure that this is due to the format of the pictures. Indeed, when I pasted a normal png image into one word doc I was then able to get it with the above code.
I have also tried to convert the document to PDF and try to extract images from there with NO result
from docx2pdf import convert
convert('./Report.docx')
convert('./Report.docx', './Report.pdf')
import fitz # PyMuPDF
def get_pixmaps_in_pdf(pdf_filename):
doc = fitz.open(pdf_filename)
xrefs = set()
for page_index in range(doc.page_count):
for image in doc.get_page_images(page_index):
xrefs.add(image[0]) # Add XREFs to set so duplicates are ignored
pixmaps = [fitz.Pixmap(doc, xref) for xref in xrefs]
doc.close()
return pixmaps
def write_pixmaps_to_pngs(pixmaps):
for i, pixmap in enumerate(pixmaps):
pixmap.save(f'{i}.png') # Might want to come up with a better name
pixmaps = get_pixmaps_in_pdf('./Report.pdf')
write_pixmaps_to_pngs(pixmaps)
So, does anyone one know if there is a way to automatically extract excel charts pasted as enhanced metafile in a word doc?
Thank you in advance for your help !
答案1
得分: 1
.docx
文件实际上是秘密的 .zip
文件,我已经成功使用 zipfile
模块从 .docx
中提取图像。这些图像应该位于提取的 .zip
文件的 word/media
目录中。我不知道增强型元文件是否也位于那里,但这是我最好的猜测。这是一些帮助你入门的内容:
import os
import zipfile
input_docx = [NAME_OF_DOCX]
archive = zipfile.ZipFile(f'{input_docx}.docx')
for file in archive.filelist:
archive.extract(file, 'extracted_docx')
for file in os.listdir('extracted_docx\\word\\media'):
if file.endswith('.emf'):
# 对文件进行一些操作
pass
(未经测试,但应该有效)
英文:
The crazy thing is .docx
files are actually secretly .zip
files, I've been able to successfully extract images from a .docx
using the zipfile
module. The images should live in the word/media
directory of the extracted .zip
. I dunno if the enhanced metafiles live there too, but that's my best guess. Here's something to get you started:
import os
import zipfile
input_docx = [NAME_OF_DOCX]
archive = zipfile.ZipFile(f'{input_docx}.docx')
for file in archive.filelist:
archive.extract(file, 'extracted_docx')
for file in os.listdir('extracted_docx\\word\\media'):
if file.endswith('.emf'):
# do something with the file
pass
(untested, but should work)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论