如何在Laravel 9中拆分docx / pdf文件页面

huangapple go评论107阅读模式
英文:

How to split docx / pdf file pages in Laravel 9

问题

我有一个在线图书馆应用程序,允许用户通过关键词搜索书籍的内容。

要向我的应用程序添加一本书,我希望能够上传书的docx文件,然后逐页提取其内容,并将其单独添加到数据库中。
我在Laravel/PHP中没有找到任何特定的库来为我完成这个任务,因此我尝试解压Docx文件并查找特定的分隔符以便基于此检测页,但显然没有这样的分隔符!

我甚至尝试从PDF文件中读取页,但返回的是无法阅读的奇怪字符!我必须指出这些书不是用英语写的,它们是用波斯语和阿拉伯语写的,这是一种从右到左的语言(请参见下面的屏幕截图)。这是一个实际数据样本的链接,格式为pdf:https://www.dropbox.com/s/g939q5oot14ib1w/test-for-stack-over-flow.pdf?dl=0。

请告诉我,如果您有任何想法如何将书的每一页单独添加到数据库(作为页面表中的新行),无论我使用什么格式,docx、pdf、txt等!

英文:

I have an online library application that allows users to search the content of books by keywords.

To add a book to my application, I want to be able to upload the docx file of the book, then extract its content page by page, and add it to the database separately.
I couldn't find any specific library in Laravel/PHP to do that for me, so I tried to unzip the Docx file and look for a specific separator to detect pages based on that, but apparently, there is no such thing!

I even tried to read pages from the PDF file, but it returns weird characters that are not readable! I have to add the books are not written in English. They are written in Persian and Arabic which is a right-to-left language (please see bellow screenshot). Here is the link for a real data sample in pdf format: https://www.dropbox.com/s/g939q5oot14ib1w/test-for-stack-over-flow.pdf?dl=0.

Please let me know if you have any idea how can I add each page of the book to the database separately (as a new row in the pages table!), no matter what format I have to use, docx, pdf, txt, etc!

如何在Laravel 9中拆分docx / pdf文件页面

答案1

得分: 1

Comments and installation feedback from @Kmaj ,are at the end.

The closest I know is far from perfect but possibly can be bettered. So command line export PDF to plain text will look like this for the first page, where a number of issues with plain text become apparent, the major hurdle is word order can often be wrong at strategic input/output points.

 = symbolic characters like emojis

So using Poppler PDFtotext gives this output for page 1 and the 1 can be in a loop from 1-100+.

pdftotext -nopgbrk -layout -f 1 -l 1 SO-75712922.pdf

‫انتشارات تست‪ /‬شمارۀ ‪53‬‬

‫کتاب تست‬
‫تست این کتاب‪‬‬

‫تست اصحاب‬
‫وصی و فرستادۀ ی‪‬‬

(خداوند در زمین تمکینش دهد)
‫مترجم‬
‫گروه مترجمان انتشارات تست ‪‬

And without -layout it should be more compact, but then Page 2 NEEDS it to be -Layout to not encourage line splits. So documents like this are highly difficult to guess such that output with layout and an alternative may need some automated selector for best of 2 or 3 runs.

Page 1 without Layout looks about 90% usable

Page 2 with Layout has the lines aligned for same height (but reversed order for tabular)

For windows users the 64 bit pre-compiled binaries (currently latest=2023-01) are at this link

From @Kmaj for Mac users

> I first installed poppler on my mac using "brew install poppler". Then, I could successfully run the command you mentioned above in a loop and generate each txt page separately with a dynamic name.

> For those who have the same issue, I ran the following command in my mac terminal:

 for i in {1..3};
 do pdftotext-nopgbrk -layout -f ${i} -l ${i} ~/mypath-to-pdf/pdf-name.pdf ~/mypath-to-output-folder/output-name-${i}.txt;
 done 
英文:

Ok I am told that for word output to PDF then importing to MS Word is likely to give best results, but PHP itself is not an MSOffice User Workstation.

Comments and installation feedback from @Kmaj ,are at the end.

The closest I know is far from perfect but possibly can be bettered. So command line export PDF to plain text will look like this for the first page, where a number of issues with plain text become apparent, the major hurdle is word order can often be wrong at strategic input/output points.

 = symbolic characters like emojis 如何在Laravel 9中拆分docx / pdf文件页面

如何在Laravel 9中拆分docx / pdf文件页面

So using Poppler PDFtotext gives this output for page 1 and the 1 can be in a loop from 1-100+.

pdftotext -nopgbrk -layout -f 1 -l 1 SO-75712922.pdf

‫انتشارات تست‪ /‬شمارۀ ‪53‬‬




‫کتاب تست‬
‫تست این کتاب‪‬‬



‫تست اصحاب‬
‫وصی و فرستادۀ ی‪‬‬

‫(خداوند در زمین تمکینش دهد)‬




‫مترجم‬
‫گروه مترجمان انتشارات تست ‪‬‬

And without -layout it should be more compact, but then Page 2 NEEDS it to be -Layout to not encourage line splits. So documents like this are highly difficult to guess such that output with layout and an alternative may need some automated selector for best of 2 or 3 runs.

Page 1 without Layout looks about 90% usable

如何在Laravel 9中拆分docx / pdf文件页面

Page 2 with Layout has the lines aligned for same height (but reversed order for tabular)
如何在Laravel 9中拆分docx / pdf文件页面

For windows users the 64 bit pre-compiled binaries (currently latest=2023-01) are at https://github.com/oschwartz10612/poppler-windows

From @Kmaj for Mac users

> I first installed poppler on my mac using "brew install poppler". Then, I could successfully run the command you mentioned above in a loop and generate each txt page separately with a dynamic name.

> For those who have the same issue, I ran the following command in my mac terminal:

 for i in {1..3};
 do pdftotext-nopgbrk -layout -f ${i} -l ${i} ~/mypath-to-pdf/pdf-name.pdf ~/mypath-to-output-folder/output-name-${i}.txt;
 done 

huangapple
  • 本文由 发表于 2023年3月12日 19:53:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/75712922.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定