你可以使用Java来获取Word文档中包含在段落中的字符串的页码。

huangapple go评论74阅读模式
英文:

how can i get page number of a string with is part of paragraph in word document using java

问题

I can provide a translation of the code-related portion:

我正在使用aspose-words库的节点集合来逐个读取Word文档节点如果节点是段落且长度超过8000个字符我会将其分成子字符串大多数情况下段落超过一页如何获取我从该段落中分割出的子字符串的页码

Document document = new Document(filePath);
LayoutCollector layoutCollector = new LayoutCollector(doc);
NodeCollection paragraphNodes = document.getChildNodes(NodeType.Paragraph, true);
for (Node node : paragraphNodes) {
    if (node.getType == NodeType.PARAGRAPH) {
        int pageNumber = layoutCollector.getStartPageIndex(node);
        List<String> subStrings = new ArrayList();
        Paragraph paragraph = (Paragraph) node;
        String text = paragraph.getText();
        if (text.length() > 8000) {
            // divideParagraph(String text)接受一个字符串并返回每个长度小于8000的字符串的ArrayList<String>
            subStrings.addAll(divideParagraph(text));
        }
        for (String subString : subStrings) {
            System.out.println("需要每个子字符串的页码");
        }
    }
}

Please note that I've provided the translation of the code part you shared. If you have any more specific questions or need further assistance, feel free to ask.

英文:

I am using aspose-words library's node collection to read a word document node by node if a node is a paragraph and length is more that 8000 characters i am dividing it into sub-strings. The paragraphs exceeding one page most of the times.How can i get page number of the sub-string which i divided from that paragraph.

Document document=new Document(filePath);
LayoutCollector layoutCollector=new LayoutCollector(doc);
NodeCollection paragraphNodes=document.getChildNodes(NodeType.Paragraph,true);
for(Node node:paragraphNodes)
{
 if(node.getType==NodeType.PARAGRAPH){
 int pageNumber=layoutCollector.getStartPageIndex(node);
 List&lt;String&gt; subStrings=new ArrayList();
 Paragraph paragraph=(Paragraph) node;
 String text=paragraph.getText();
 if(text.length()&gt;8000){
  //divideParagraph(String text) takes a string and returns ArrayList&lt;Strings&gt; each 
  String less than 8000 length
  subStrings.addAll(divideParagraph(text));
 }
 for(String subString:subStrings)
 {
 System.out.println(&quot;need the page number of each substring &quot;);
 }
}

currently i am able to get start page and end page of a specific paragraph using layoutCollector but looking for sub-string's page number which i divide from paragraph because i have to report it in log. Is there any other library with which i can read all elements like paragraph, table , wordart etc having tract of pagenumber and line number where it starts.

答案1

得分: 2

正如您所知,由于Microsoft Word文档的流动性质,它们没有页面或行的概念。消费者应用程序会动态构建文档布局,Aspose.Words也是如此,它使用自己的布局引擎。LayoutCollectorLayoutEnumerator类提供了有限的访问文档布局信息的方式。

如果您想确定段落的哪一部分位于哪一页,您应该循环遍历段落的子节点,并使用LayoutCollector.getStartPageIndexLayoutCollector.getEndPageIndex。但请注意,即使是“最小”的文本节点 - Run也可以跨越多个页面。因此,如果您需要确切地确定段落流向下一页的位置,就需要将段落中的内容拆分为更小的部分,例如单词。

例如,以下代码演示了逐行读取文档内容的基本技巧:

Document doc = new Document("C:\\Temp\\in.docx");
    
// 将文档中的所有Run节点拆分,以确保它们不超过一个单词。
Iterable<Run> runs = doc.getChildNodes(NodeType.RUN, true);
for (Run r : runs)
{
    Run current = r;
    while (current.getText().indexOf(' ') >= 0)
        current = SplitRun(current, current.getText().indexOf(' ') + 1);
}
    
// 使用书签包装文档中的所有运行以便使用LayoutCollector和LayoutEnumerator进行操作
runs = doc.getChildNodes(NodeType.RUN, true);
    
ArrayList<String> tmpBookmarks = new ArrayList<String>();
int bkIndex = 0;
for (Run r : runs)
{
    // LayoutCollector和LayoutEnumerator无法处理页眉/页脚或文本框中的节点。
    if (r.getAncestor(NodeType.HEADER_FOOTER) != null || r.getAncestor(NodeType.SHAPE) != null)
        continue;
        
    String bkName = "r" + bkIndex;
    r.getParentNode().insertBefore(new BookmarkStart(doc, bkName), r);
    r.getParentNode().insertAfter(new BookmarkEnd(doc, bkName), r);
        
    tmpBookmarks.add(bkName);
    bkIndex++;
}
    
// 现在,我们可以使用收集器和枚举器来获取MS Word文档中每一行的运行。
LayoutCollector collector = new LayoutCollector(doc);
LayoutEnumerator enumerator = new LayoutEnumerator(doc);
    
Object currentLine = null;
int pageIndex = -1;
for (String bkName : tmpBookmarks)
{
    Bookmark bk = doc.getRange().getBookmarks().get(bkName);
        
    enumerator.setCurrent(collector.getEntity(bk.getBookmarkStart()));
    while (enumerator.getType() != LayoutEntityType.LINE)
        enumerator.moveParent();
        
    if (currentLine != enumerator.getCurrent())
    {
        currentLine = enumerator.getCurrent();
            
        System.out.println();
        if(pageIndex!=enumerator.getPageIndex())
        {
            pageIndex = enumerator.getPageIndex();
            System.out.println("-------=========Start Of Page " + pageIndex + "=========-------");
        }
        System.out.println("-------=========Start Of Line=========-------");
    }
        
    Node node = bk.getBookmarkStart().getNextSibling();
    if (node != null && node.getNodeType() == NodeType.RUN)
        System.out.print(((Run)node).getText());
}
private static Run SplitRun(Run run, int position)
{
    Run afterRun = (Run)run.deepClone(true);
    run.getParentNode().insertAfter(afterRun, run);
    afterRun.setText(run.getText().substring(position));
    run.setText(run.getText().substring(0, position));
    return afterRun;
}
英文:

As you know there is no concept of page or line in MS Word documents due to their flow nature. The consumer applications build document layout on the fly, the same does Aspose.Words using it’s own layout engine. LayoutCollector and LayoutEnumerator classes provides a limited access to document layout information.

If you would like to determine on which page part of the paragraph is located you should loop through paragraph's child nodes and and use LayoutCollector.getStartPageIndex or LayoutCollector.getEndPageIndex. But you should note, that even the "smallest" text node - Run can span several pages. So if you need exactly determine position where paragraph flows to the next page, it is required to split content in the paragraph into smaller peeces, for example into words.

For example the following code demonstrates a basic technique to read document content line by line:

Document doc = new Document(&quot;C:\\Temp\\in.docx&quot;);
    
// Split all Run nodes in the document to make them not more than one word.
Iterable&lt;Run&gt; runs = doc.getChildNodes(NodeType.RUN, true);
for (Run r : runs)
{
    Run current = r;
    while (current.getText().indexOf(&#39; &#39;) &gt;= 0)
        current = SplitRun(current, current.getText().indexOf(&#39; &#39;) + 1);
}
    
// Wrap all runs in the document with bookmarks to make it possible to work with LayoutCollector and LayoutEnumerator
runs = doc.getChildNodes(NodeType.RUN, true);
    
ArrayList&lt;String&gt; tmpBookmakrs = new ArrayList&lt;String&gt;();
int bkIndex = 0;
for (Run r : runs)
{
    // LayoutCollector and LayoutEnumerator does not work with nodes in header/footer or in textboxes.
    if (r.getAncestor(NodeType.HEADER_FOOTER) != null || r.getAncestor(NodeType.SHAPE) != null)
        continue;
        
    String bkName = &quot;r&quot; + bkIndex;
    r.getParentNode().insertBefore(new BookmarkStart(doc, bkName), r);
    r.getParentNode().insertAfter(new BookmarkEnd(doc, bkName), r);
        
    tmpBookmakrs.add(bkName);
    bkIndex++;
}
    
// Now we can use collector and enumerator to get runs per line in MS Word document.
LayoutCollector collector = new LayoutCollector(doc);
LayoutEnumerator enumerator = new LayoutEnumerator(doc);
    
Object currentLine = null;
int pageIndex = -1;
for (String bkName : tmpBookmakrs)
{
    Bookmark bk = doc.getRange().getBookmarks().get(bkName);
        
    enumerator.setCurrent(collector.getEntity(bk.getBookmarkStart()));
    while (enumerator.getType() != LayoutEntityType.LINE)
        enumerator.moveParent();
        
    if (currentLine != enumerator.getCurrent())
    {
        currentLine = enumerator.getCurrent();
            
        System.out.println();
        if(pageIndex!=enumerator.getPageIndex())
        {
            pageIndex = enumerator.getPageIndex();
            System.out.println(&quot;-------=========Start Of Page &quot; + pageIndex + &quot;=========-------&quot;);
        }
        System.out.println(&quot;-------=========Start Of Line=========-------&quot;);
    }
        
    Node node = bk.getBookmarkStart().getNextSibling();
    if (node != null &amp;&amp; node.getNodeType() == NodeType.RUN)
        System.out.print(((Run)node).getText());
}
private static Run SplitRun(Run run, int position)
{
    Run afterRun = (Run)run.deepClone(true);
    run.getParentNode().insertAfter(afterRun, run);
    afterRun.setText(run.getText().substring(position));
    run.setText(run.getText().substring(0, position));
    return afterRun;
}

huangapple
  • 本文由 发表于 2023年6月9日 13:23:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76437423.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定