2023年1月9日 02:26:24go评论127阅读模式

英文:

Extracting text from PDF in Arabic language and getting backwards text

问题

我制作了一个Python脚本，可以将包含短语的PDF提取到Anki卡组中。这个脚本在非闪族语言上运行得很好，但当有人要求我创建一个类似的阿拉伯语卡组时，我遇到了问题。在阿拉伯语中，文字是从右到左书写的，但我提取到的句子是从左到右书写的。这可能与提取阶段需要额外处理闪族语言有关，我只是不知道是什么问题。

示例：

实际文本：

我得到的文本：
sentence = "AR.(ةناشطع ♀) ناشطع نينكلو (ةعئاج تسل ♀) ،اعئاج تسل"

我使用了PyPDF2来提取文本，尝试了arabic-reshaper 2.1.4和python-bidi来解决这个问题，但都没有成功。我还尝试了各种形式的reverse，但它也会反转标点符号，比如"("。有什么想法吗？

英文:

I've made a python script that takes a pdf with phrases and extract them into an anki deck. The script worked great with non semitic languages but when someone asked me to make a similar deck in Arabic I encountered a problem. In arabic you write from right to left, but the sentence I get it's written from left to write. It must be something about the extraction phase that need something extra to work with semitic languages, I just don't know what it is.

Example:

The actual text:

The text that I got:
sentence = "AR.(ةناشطع ♀) ناشطع نينكلو (ةعئاج تسل ♀) ،اعئاج تسل"

I used PyPDF2 to extract the text and tried arabic-reshaper 2.1.4 and python-bidi to solve this but to no avail. I also tried reverse in various forms but it also reverses punctuation signs like "(".
Any ideas?

答案1

得分: 1

我曾经使用pdfplumber从（ born digital ）PDF中提取阿拉伯文本取得了一些成功。我的“一些成功”指的是这是一项巨大的困扰，最终并不准确。困扰的部分是因为提取的文本是倒过来的，并且在每个变音符号旁边插入了一个空格。这些问题是可以修复的 - 下面是一些代码。

但准确性问题是因为我使用的是一本用漂亮字体编写的阿拉伯小说的PDF，其中一些字母有点叠在一起。pdfplumber 能够基本提取出哪些字母存在，但不能确定它们的顺序。（这对于阿拉伯语的人类学生来说也不奇怪。）如果你的源文件使用普通字体，可能会有更好的结果。

下面示例中的文本应该读取如下：
في رأس النهج تبينت مدقة الباب النحاسية التي استقرت فوق الخد
الخشبي الضخم علي هيئة قبضة يد نصف مضمومة. الهدوء

import pdfplumber
file = 'sample_page.pdf'
pdf = pdfplumber.open(file)
page = pdf.pages[0]
text = page.extract_text()
print(text[:110])

输出：
دّ لخا قوف ترّ قتسا يتلا ةيّ ساحنلا بابلا ةقّ دم تنيّ بت جهنلا سأر في لاإ مٌ يّ مخ ءودلها .ةمومضم فصن دي ةض

^ 这是倒过来的，所有变音符号旁边都有空格

# 使用bidi反转文本
from bidi import algorithm
text_rev = algorithm.get_display(text)
print(text_rev[:110])

输出：
يف رأس النهج تب ّينت مد ّقة الباب النحاس ّية التي استق ّرت فوق اخل ّد
اخلشب ّي الضخم عىل هيئة قبضة يد نصف مضم

^ 不再倒过来，但还是有变音符号的问题

# 去除最常见的变音符号 - 在实际使用中，您需要去除所有变音符号
shadda = unichr(0x0651)
text_rev_dediac = text_rev.replace(" "+shadda, '')
print(text_rev_dediac[:110])

输出：
يف رأس النهج تبينت مدقة الباب النحاسية التي استقرت فوق اخلد
اخلشبي الضخم عىل هيئة قبضة يد نصف مضمومة. اهلدوء

^ 这是正确的，除了叠在一起的字母顺序错误（例如，第一个词应该是 "في"（fy 'in'），但实际上是 "يف"（yf）。但你可以看到句号（在词 "مضمومة" 后面）仍然在正确的位置。所以这相当成功，如果使用更简单的字体可能会100％准确。

祝你好运！

英文:

I've had some success extracting Arabic text from (born digital) PDFs using pdfplumber. By "some success" I mean that it was a huge pain in the... neck, and didn't end up being accurate enough for my purposes. The pain part was because the extracted text was backwards and it had inserted a space next to every diacritic. Those were fixable — some code is below.

But the accuracy problem was because I was using a PDF of an Arabic novel that was written in a pretty font where some of the letters are kind of stacked on top of each other. pdfplumber was mostly able to extract what letters were there, but not which order. (Not surprising — this is tough for human students of Arabic as well.) If your source is using a plain font you might have better results.

The text in the sample below should read:
في رأس النهج تبينت مدقة الباب النحاسية التي استقرت فوق الخد
الخشبي الضخم علي هيئة قبضة يد نصف مضمومة. الهدوء

import pdfplumber
file = &#39;sample_page.pdf&#39;
pdf = pdfplumber.open(file)
page = pdf.pages[0]
text = page.extract_text()
print(text[:110])
output:
دّ لخا قوف ترّ قتسا يتلا ةيّ ساحنلا بابلا ةقّ دم تنيّ بت جهنلا سأر في لاإ مٌ يّ مخ ءودلها .ةمومضم فصن دي ةض

^ This is backwards and all there are spaces next to the diacritics

# Reverse text with bidi
from bidi import algorithm
text_rev = algorithm.get_display(text)
print(text_rev[:110])
output:
يف رأس النهج تب ّينت مد ّقة الباب النحاس ّية التي استق ّرت فوق اخل ّد 
اخلشب ّي الضخم عىل هيئة قبضة يد نصف مضم

^ Not backwards anymore, but still the diacritic problem

# Strip most common diacritic — in real use you would need to get all of them
shadda = unichr(0x0651)
text_rev_dediac = text_rev.replace(&quot; &quot;+shadda, &#39;&#39;)
print(text_rev_dediac[:110])
output:
يف رأس النهج تبينت مدقة الباب النحاسية التي استقرت فوق اخلد 
اخلشبي الضخم عىل هيئة قبضة يد نصف مضمومة. اهلدوء

^ This is right, except where the stacked letters are in the wrong order (like the first word is supposed to be في (fy 'in') but instead it's يف (yf). You can see that the period (after the word مضمومة) is still in the correct place, though. So this is pretty suceessful, and might be 100% accurate with an easier font.

Good luck!

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从阿拉伯语的PDF中提取文本并获取反向文本。

问题

答案1

为什么如果我通过VS Code而不是PyCharm运行程序，图像不可用？

如何将外键的值自动添加到多对多字段？

打开一个存储在变量中的Python文件。

在Pandas中旋转数据框。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。