使用Apache-POI获取docx文档中每个段落的行。

huangapple go评论93阅读模式
英文:

Getting the lines of each paragraphs of a docx with Apache-POI

问题

我正在使用Apache-POI库来开发我的应用程序。具体地,我使用POIshadow-all库(版本3.17)来读取Word文档。
我已成功按以下方式提取每个段落:

(以下为提取段落的代码部分)

实际上,我需要按以下方式提取每一行:

(以下为提取行的示例图片)

提取每个段落的代码如下:

(以下为提取段落的代码部分)

尝试过程中,我遇到了困难。变量currentParagraph返回了整个段落,这是预期的结果。然而,我需要一个名为currentLine的变量,它返回一行文本。

我在Stack Overflow和其他网站上查阅了相关问题,找到了一些解决方案,但它们对我并不奏效。我还尝试过通过控制字符(ctr)和使用XWPFRun来获取日期,但都没有成功。

如果您能就如何继续处理此问题提供建议,我将不胜感激。

提前感谢您的帮助。

英文:

I am using the library Apache-POI for my app. Specifically, POIshadow-all (ver. 3.17) for reading a Word document.
I am successfully extracting every paragraph as follows:

使用Apache-POI获取docx文档中每个段落的行。

what I actually need is extract every line, as follows:

使用Apache-POI获取docx文档中每个段落的行。

The code to extract every paragraph is this:

 try {

            val fis = FileInputStream(path.path + "/" + document)
            val xdoc = XWPFDocument(OPCPackage.open(fis))

            val paragraphList: MutableList<XWPFParagraph> = xdoc.paragraphs

            private val newParagraph = paragraph.createRun()

                ...

            for (par in paragraphList) {

                    var currentParagraph = par.text
                    Log.i("TAG","current: $currentParagraph")

                        ...

The variable currentParagraph returns a whole paragraph, as expected. However, I would need a variable named currentLine which returns a single line.

I've research about this issue in stackoverflow and other sites. I've found some proposals but none of them works for me.
I also tried get dates by ctr and using XWPFRun, without any success.

I would be grateful for any recommendation on how to proceed.

Thanks in advance for your help.

答案1

得分: 3

一个文档的元数据不会存储给定段落中有多少行,因为这取决于您如何呈现或查看它。想象一下一个 Word 文档,如果字体大小较大,那么在给定段落中会有更多的行,而如果字体大小较小,那么段落中的行数会较少。因此,每个段落中的行数是不一致的,即是一个可变的变量。

然而,如果在您的应用程序中有一个硬性要求需要估计,您可以编写一些逻辑,比如***“在 X(一个常数)个字符之后开始新的一行(舍入到单词的末尾)”***。这也可能会根据屏幕大小、字体大小、缩放级别等而改变,因此我的建议是,在您的应用程序中构思一种情况,在该情况下,您不会明确测量给定段落中的行数,而是计算单词或字符的数量,并将其用作必要时插入换行符的标准度量。

您还可以使用转义字符来分隔句子,例如***“在段落内的每个‘?’、‘!’或‘。’字符后开始一个新的句子。”*** 这也可能变得相当棘手,取决于某些句子的结构。

因此,对于您的问题,答案是使用 Apache POI 没有“开箱即用”的方法来检测给定段落中的行数,如果绝对有必要,您需要编写自己的逻辑(也许使用上面提到的方法之一)。

英文:

The metadata of a document does not store how many lines are there in a given paragraph because it depends on how you render or view it. Think of a word document, if you have a larger font-size, you will have more lines in a given paragraph, alternatively, if you have a smaller font-size, you would have fewer lines in a paragraph. Therefore, the number of lines in each paragraph is inconsistent i.e. a variable.

However, if there’s a hard and fast requirement within your application to have an estimate, you can program some logic like “start a new line after X (a constant) number of characters (round off to the end of the word)”. This again could change depending on the screen size, font-size, zoom-level etc. so my suggestion would be to work out a scenario in your application where you do not explicitly measure the number of lines in a given paragraph, rather the number of words or characters and use that as a yardstick measure to insert a line-break if absolutely necessary.

Another potential approach you could use would be to separate sentences using escape characters e.g. “Start a new sentence after each ‘?’, ‘!’ or ‘.’ character within a paragraph.” This too can get rather tricky, depending on the structure of certain sentences.

Therefore, the answer to your question is that there is no “out of the box” way to detect the number of lines in a given paragraph using Apache POI, you would have to program your own logic there (perhaps using an approach outlined above), if absolutely necessary.

huangapple
  • 本文由 发表于 2020年9月14日 05:39:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/63875832.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定