2023年3月31日 22:35:03go评论73阅读模式

英文:

How to extract pictures as enhanced metafile from word documents in python?

问题

以下是翻译好的部分：

我想以自动方式从Word文档中提取图像。这些图像是作为图片（增强的图元文件）粘贴到Word文档中的Excel图表。

在进行快速研究后，我尝试使用以下方法：

import docx2txt as d2t 

def extract_images_from_docx(path_to_file, images_folder, get_text = False): 
    text = d2t.process(path_to_file, images_folder)

    if get_text:
        return text

path_to_file = './Report.docx'
images_folder = './Img/'

extract_images_from_docx(path_to_file, images_folder, False)

然而，这种方法不起作用。我几乎可以确定这是因为图片的格式。事实上，当我将普通的PNG图像粘贴到一个Word文档中，然后可以使用上述代码获取它。

我还尝试将文档转换为PDF，并尝试从中提取图像，但没有结果：

from docx2pdf import convert

convert('./Report.docx')
convert('./Report.docx', './Report.pdf')

import fitz  # PyMuPDF

def get_pixmaps_in_pdf(pdf_filename):
    doc = fitz.open(pdf_filename)
    xrefs = set()
    for page_index in range(doc.page_count):
        for image in doc.get_page_images(page_index):
            xrefs.add(image[0])  # 将XREF添加到集合中，以忽略重复项
    pixmaps = [fitz.Pixmap(doc, xref) for xref in xrefs]
    doc.close()
    return pixmaps

def write_pixmaps_to_pngs(pixmaps):
    for i, pixmap in enumerate(pixmaps):
        pixmap.save(f'{i}.png')  # 可能需要想出更好的名称

pixmaps = get_pixmaps_in_pdf('./Report.pdf')
write_pixmaps_to_pngs(pixmaps)

所以，是否有人知道是否有一种方法可以自动提取作为增强图元文件粘贴在Word文档中的Excel图表？

非常感谢您的帮助！

英文:

I want to extract in an automatic way images from a word document. The images are excel charts pasted as picture (enhanced metafile) into the worddoc.

After a quick research I tried to use the following method

import docx2txt as d2t 

def extract_images_from_docx(path_to_file, images_folder, get_text = False): 
    text = d2t.process(path_to_file, images_folder)

    if get_text:
        return text

path_to_file = &#39;./Report.docx&#39;
images_folder = &#39;./Img/&#39;

extract_images_from_docx(path_to_file, images_folder, False)

However, this method does NOT work. I am almost sure that this is due to the format of the pictures. Indeed, when I pasted a normal png image into one word doc I was then able to get it with the above code.

I have also tried to convert the document to PDF and try to extract images from there with NO result

from docx2pdf import convert

convert(&#39;./Report.docx&#39;)
convert(&#39;./Report.docx&#39;, &#39;./Report.pdf&#39;)

import fitz  # PyMuPDF


def get_pixmaps_in_pdf(pdf_filename):
    doc = fitz.open(pdf_filename)
    xrefs = set()
    for page_index in range(doc.page_count):
        for image in doc.get_page_images(page_index):
            xrefs.add(image[0])  # Add XREFs to set so duplicates are ignored
    pixmaps = [fitz.Pixmap(doc, xref) for xref in xrefs]
    doc.close()
    return pixmaps


def write_pixmaps_to_pngs(pixmaps):
    for i, pixmap in enumerate(pixmaps):
        pixmap.save(f&#39;{i}.png&#39;)  # Might want to come up with a better name


pixmaps = get_pixmaps_in_pdf(&#39;./Report.pdf&#39;)
write_pixmaps_to_pngs(pixmaps)

So, does anyone one know if there is a way to automatically extract excel charts pasted as enhanced metafile in a word doc?

Thank you in advance for your help !

答案1

得分: 1

.docx 文件实际上是秘密的 .zip 文件，我已经成功使用 zipfile 模块从 .docx 中提取图像。这些图像应该位于提取的 .zip 文件的 word/media 目录中。我不知道增强型元文件是否也位于那里，但这是我最好的猜测。这是一些帮助你入门的内容:

import os
import zipfile

input_docx = [NAME_OF_DOCX]
archive = zipfile.ZipFile(f'{input_docx}.docx')
for file in archive.filelist:
    archive.extract(file, 'extracted_docx')
for file in os.listdir('extracted_docx\\word\\media'):
    if file.endswith('.emf'):
        # 对文件进行一些操作
        pass

（未经测试，但应该有效）

英文:

The crazy thing is .docx files are actually secretly .zip files, I've been able to successfully extract images from a .docx using the zipfile module. The images should live in the word/media directory of the extracted .zip. I dunno if the enhanced metafiles live there too, but that's my best guess. Here's something to get you started:

import os
import zipfile

input_docx = [NAME_OF_DOCX]
archive = zipfile.ZipFile(f&#39;{input_docx}.docx&#39;)
for file in archive.filelist:
    archive.extract(file, &#39;extracted_docx&#39;)
for file in os.listdir(&#39;extracted_docx\\word\\media&#39;):
    if file.endswith(&#39;.emf&#39;):
        # do something with the file
        pass

(untested, but should work)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Python中从Word文档中提取增强的图元文件（Enhanced Metafile）作为图像？

问题

答案1

Pandas – 创建新列，其值取自同一数据框中的其他行

如何在使用FastAPI时同时返回PDF文件和Jinja2模板响应？

如何将Web服务器添加到现有的长时间运行的Python程序？

打开、保存并关闭一个Excel文件，无需交互，可通过命令提示符完成。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论