英文:
Replacing text in XWPFParagraph without changing format of the docx file
问题
我正在开发一个字体转换应用程序,它将把Unicode字体文本转换为Krutidev/Shree Lipi(马拉地语/印地语)字体文本。在原始的docx文件中,有格式化的单词(即文本的颜色、字体、大小、超链接等等)。
我希望在将单词从Unicode转换为另一种字体后,保持最终docx的格式与原始docx相同。
PFA.
这是我的代码
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//将文档写入文件系统
FileOutputStream out = new FileOutputStream(new File("Output.docx"));
document.write(out);
out.close();
System.out.println("Output.docx写入成功");
}
catch (IOException e) {
System.out.println("在读取Word文档时出现错误");
}
英文:
I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}
答案1
得分: 1
谢谢询问和回答。
我几年前使用过POI,但是处理的是Excel工作簿,但我还是会尽力帮助您找到错误的根本原因。
Java编译器足够智能,能够自动提供良好的调试信息!
消除错误的一个良好的第一步是不要覆盖编译器提供给您的异常消息。
尝试打印e.getLocalizedMessage()或e.getMessage()的结果,看看您会得到什么。
通常使用printStackTrace方法获取堆栈跟踪也有助于准确定位错误所在!
请分享上述方法调用的结果,以进一步帮助您调试问题。
[编辑1:]
所以似乎您可以正确处理文件,但无法在转换后的数据文件中重建原始数据的格式。
(因此,“在读取Word文档时出现错误”是一种错误的打印方式 ;))
现在,Word文档有两个要素:
- 内容
- 结构或模式
您可以转换数据,因为您只处理了文档文件的内容部分。
为了能够保留内容的格式,您的解决方案还需要了解文档文件的格式,并加以处理。
定义了doc文件及其扩展名(.docx)的MS Word遵循一组特定的格式规则。这些模式在Microsoft的XML命名空间包中定义1。
您可以轻松获取所需doc文件的XML(HTML)格式(请参阅1中的步骤或链接2中的代码),并根据MS的命名空间提供的定义来应用不同的模式或可能是您自己的模式定义。要以编程方式执行此操作,您需要熟悉XML、XSL和XSLT概念(w3schools[3]是一个很好的起点),但这种方法与编写自己的MS-Word版本一样复杂;或者使用MS-Word内置的工具,如1中所示。
1. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file。
2. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
我的回答为您提供了如何实现您想要的内容的概览,但根据您的倾向和时间可用性,您可能需要在决定走上哪条道路之前慎重考虑。希望有所帮助!
英文:
Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed )
Now, there are 2 elements to a Word document:
- Content
- Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages1.
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in 1 or code in link 2) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in 1.
1. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
2. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论