如何将docx转换为xhtml

huangapple go评论107阅读模式
英文:

How to convert docx to xhtml

问题

我正在尝试找到将docx文件转换为XHTML的解决方案。

我发现了xdocreport,看起来不错,但我遇到了一些问题。(而且我对xdocreport不熟悉)

根据他们在GitHub上的文档这里这里:我应该能够使用以下代码进行转换:

  1. String source = args[0];
  2. String dest = args[1];
  3. // 1) 创建 DOCX 到 XHTML 的选项以从注册表中选择合适的转换器
  4. Options options = Options.getFrom(DocumentKind.DOCX).to(ConverterTypeTo.XHTML);
  5. // 2) 从注册表中获取转换器
  6. IConverter converter = ConverterRegistry.getRegistry().getConverter(options);
  7. // 3) 将 DOCX 转换为 (x)html
  8. try {
  9. InputStream in = new FileInputStream(new File(source));
  10. OutputStream out = new FileOutputStream(new File(dest));
  11. converter.convert(in, out, options);
  12. } catch (XDocConverterException | FileNotFoundException e) {
  13. e.printStackTrace();
  14. }

我正在使用以下依赖项(尝试过不同的版本,如2.0.2、2.0.0、1.0.6):

  1. <dependency>
  2. <groupId>fr.opensagres.xdocreport</groupId>
  3. <artifactId>fr.opensagres.xdocreport.document.docx</artifactId>
  4. <version>2.0.2</version>
  5. </dependency>
  6. <dependency>
  7. <groupId>fr.opensagres.xdocreport</groupId>
  8. <artifactId>fr.opensagres.xdocreport.template.freemarker</artifactId>
  9. <version>2.0.2</version>
  10. </dependency>
  11. <dependency>
  12. <groupId>fr.opensagres.xdocreport</groupId>
  13. <artifactId>fr.opensagres.xdocreport.converter.docx.xwpf</artifactId>
  14. <version>2.0.2</version>
  15. </dependency>

我的问题:

  • 图像丢失了
  • 背景颜色丢失了(所有页面都有背景颜色,而且不是白色,我也需要转换这个)

我如何处理这些问题?
(或者我如何使用Docx4j将docx转换为带有格式、编号和图像的xhtml?)

英文:

I am trying to find a solution to convert a docx file to XHTML.

I found xdocreport, which looks good, but I have some issues. (and I am new to xdocreport)

According to their documentations on github here and here: I should be able to convert with this code:

  1. String source = args[0];
  2. String dest = args[1];
  3. // 1) Create options DOCX to XHTML to select well converter form the registry
  4. Options options = Options.getFrom(DocumentKind.DOCX).to(ConverterTypeTo.XHTML);
  5. // 2) Get the converter from the registry
  6. IConverter converter = ConverterRegistry.getRegistry().getConverter(options);
  7. // 3) Convert DOCX to (x)html
  8. try {
  9. InputStream in = new FileInputStream(new File(source));
  10. OutputStream out = new FileOutputStream(new File(dest));
  11. converter.convert(in, out, options);
  12. } catch (XDocConverterException | FileNotFoundException e) {
  13. e.printStackTrace();
  14. }

I am using these dependencies (tried different versions, like 2.0.2, 2.0.0, 1.0.6):

  1. &lt;dependency&gt;
  2. &lt;groupId&gt;fr.opensagres.xdocreport&lt;/groupId&gt;
  3. &lt;artifactId&gt;fr.opensagres.xdocreport.document.docx&lt;/artifactId&gt;
  4. &lt;version&gt;2.0.2&lt;/version&gt;
  5. &lt;/dependency&gt;
  6. &lt;dependency&gt;
  7. &lt;groupId&gt;fr.opensagres.xdocreport&lt;/groupId&gt;
  8. &lt;artifactId&gt;fr.opensagres.xdocreport.template.freemarker&lt;/artifactId&gt;
  9. &lt;version&gt;2.0.2&lt;/version&gt;
  10. &lt;/dependency&gt;
  11. &lt;dependency&gt;
  12. &lt;groupId&gt;fr.opensagres.xdocreport&lt;/groupId&gt;
  13. &lt;artifactId&gt;fr.opensagres.xdocreport.converter.docx.xwpf&lt;/artifactId&gt;
  14. &lt;version&gt;2.0.2&lt;/version&gt;
  15. &lt;/dependency&gt;

My issues:

  • The images are missing
  • The background color is missing (all pages have a background color, which is not white and I have to convert this too)

How can I handle these issues?
(Or how can I convert docx to xhtml using Docx4j with formats/numbering/images?)

答案1

得分: 2

  1. import java.io.*;
  2. import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;
  3. import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;
  4. import fr.opensagres.poi.xwpf.converter.core.ImageManager;
  5. import org.apache.poi.xwpf.usermodel.*;
  6. public class DOCXToXHTMLXDocReport {
  7. public static void main(String[] args) throws Exception {
  8. String docPath = "./WordDocument.docx";
  9. String root = "./";
  10. String htmlPath = root + "WordDocument.html";
  11. XWPFDocument document = new XWPFDocument(new FileInputStream(docPath));
  12. XHTMLOptions options = XHTMLOptions.create().setImageManager(new ImageManager(new File(root), "images"));
  13. FileOutputStream out = new FileOutputStream(htmlPath);
  14. XHTMLConverter.getInstance().convert(document, out, options);
  15. out.close();
  16. document.close();
  17. }
  18. }

This handles images properly.

But XDocReport is unable handling page background colors of XWPFDocument properly until now. It extracts and handles paragraph background colors but not page background colors.

  1. <details>
  2. <summary>英文:</summary>
  3. To convert `*.docx` to `XHTML` using `XDocReport` and `apache poi`&#39;s `XWPFDocument` as the source you will need `XHTMLOptions`. Those options are able having `ImageManager` to set the path for extracted images from `XWPFDocument`. Then `XHTMLConverter` is needed to convert.
  4. Complete example:
  5. import java.io.*;
  6. //needed jars: xdocreport-2.0.2.jar,
  7. import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;
  8. import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;
  9. import fr.opensagres.poi.xwpf.converter.core.ImageManager;
  10. //needed jars: all apache poi dependencies
  11. import org.apache.poi.xwpf.usermodel.*;
  12. public class DOCXToXHTMLXDocReport {
  13. public static void main(String[] args) throws Exception {
  14. String docPath = &quot;./WordDocument.docx&quot;;
  15. String root = &quot;./&quot;;
  16. String htmlPath = root + &quot;WordDocument.html&quot;;
  17. XWPFDocument document = new XWPFDocument(new FileInputStream(docPath));
  18. XHTMLOptions options = XHTMLOptions.create().setImageManager(new ImageManager(new File(root), &quot;images&quot;));
  19. FileOutputStream out = new FileOutputStream(htmlPath);
  20. XHTMLConverter.getInstance().convert(document, out, options);
  21. out.close();
  22. document.close();
  23. }
  24. }
  25. This handles images properly.
  26. But `XDocReport` is unable handling page background colors of `XWPFDocument` properly until now. It extracts and handles paragraph background colors but not page background colors.
  27. </details>

huangapple
  • 本文由 发表于 2020年9月11日 16:06:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/63843154.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定