从阿拉伯语的PDF中提取文本并获取反向文本。

huangapple go评论85阅读模式
英文:

Extracting text from PDF in Arabic language and getting backwards text

问题

我制作了一个Python脚本,可以将包含短语的PDF提取到Anki卡组中。这个脚本在非闪族语言上运行得很好,但当有人要求我创建一个类似的阿拉伯语卡组时,我遇到了问题。在阿拉伯语中,文字是从右到左书写的,但我提取到的句子是从左到右书写的。这可能与提取阶段需要额外处理闪族语言有关,我只是不知道是什么问题。

示例:

实际文本:
从阿拉伯语的PDF中提取文本并获取反向文本。

我得到的文本:
sentence = "AR.(ةناشطع ♀) ناشطع نينكلو (ةعئاج تسل ♀) ،اعئاج تسل"

我使用了PyPDF2来提取文本,尝试了arabic-reshaper 2.1.4和python-bidi来解决这个问题,但都没有成功。我还尝试了各种形式的reverse,但它也会反转标点符号,比如"("。有什么想法吗?

英文:

I've made a python script that takes a pdf with phrases and extract them into an anki deck. The script worked great with non semitic languages but when someone asked me to make a similar deck in Arabic I encountered a problem. In arabic you write from right to left, but the sentence I get it's written from left to write. It must be something about the extraction phase that need something extra to work with semitic languages, I just don't know what it is.

Example:

The actual text:
从阿拉伯语的PDF中提取文本并获取反向文本。

The text that I got:
sentence = "AR.(ةناشطع ♀) ناشطع نينكلو (ةعئاج تسل ♀) ،اعئاج تسل"

I used PyPDF2 to extract the text and tried arabic-reshaper 2.1.4 and python-bidi to solve this but to no avail. I also tried reverse in various forms but it also reverses punctuation signs like "(".
Any ideas?

答案1

得分: 1

我曾经使用pdfplumber从( born digital )PDF中提取阿拉伯文本取得了一些成功。我的“一些成功”指的是这是一项巨大的困扰,最终并不准确。困扰的部分是因为提取的文本是倒过来的,并且在每个变音符号旁边插入了一个空格。这些问题是可以修复的 - 下面是一些代码。

但准确性问题是因为我使用的是一本用漂亮字体编写的阿拉伯小说的PDF,其中一些字母有点叠在一起。pdfplumber 能够基本提取出哪些字母存在,但不能确定它们的顺序。(这对于阿拉伯语的人类学生来说也不奇怪。)如果你的源文件使用普通字体,可能会有更好的结果。

下面示例中的文本应该读取如下:
في رأس النهج تبينت مدقة الباب النحاسية التي استقرت فوق الخد
الخشبي الضخم علي هيئة قبضة يد نصف مضمومة. الهدوء

import pdfplumber

file = 'sample_page.pdf'
pdf = pdfplumber.open(file)
page = pdf.pages[0]
text = page.extract_text()
print(text[:110])

输出:
دّ لخا قوف ترّ قتسا يتلا ةيّ ساحنلا بابلا ةقّ دم تنيّ بت جهنلا سأر في لاإ مٌ يّ مخ ءودلها .ةمومضم فصن دي ةض

^ 这是倒过来的,所有变音符号旁边都有空格

# 使用bidi反转文本
from bidi import algorithm

text_rev = algorithm.get_display(text)
print(text_rev[:110])

输出:
يف رأس النهج تب ّينت مد ّقة الباب النحاس ّية التي استق ّرت فوق اخل ّد
اخلشب ّي الضخم عىل هيئة قبضة يد نصف مضم

^ 不再倒过来,但还是有变音符号的问题

# 去除最常见的变音符号 - 在实际使用中,您需要去除所有变音符号
shadda = unichr(0x0651)
text_rev_dediac = text_rev.replace(" "+shadda, '')
print(text_rev_dediac[:110])

输出:
يف رأس النهج تبينت مدقة الباب النحاسية التي استقرت فوق اخلد
اخلشبي الضخم عىل هيئة قبضة يد نصف مضمومة. اهلدوء

^ 这是正确的,除了叠在一起的字母顺序错误(例如,第一个词应该是 "في"(fy 'in'),但实际上是 "يف"(yf)。但你可以看到句号(在词 "مضمومة" 后面)仍然在正确的位置。所以这相当成功,如果使用更简单的字体可能会100%准确。

祝你好运!

英文:

I've had some success extracting Arabic text from (born digital) PDFs using pdfplumber. By "some success" I mean that it was a huge pain in the... neck, and didn't end up being accurate enough for my purposes. The pain part was because the extracted text was backwards and it had inserted a space next to every diacritic. Those were fixable — some code is below.

But the accuracy problem was because I was using a PDF of an Arabic novel that was written in a pretty font where some of the letters are kind of stacked on top of each other. pdfplumber was mostly able to extract what letters were there, but not which order. (Not surprising — this is tough for human students of Arabic as well.) If your source is using a plain font you might have better results.

The text in the sample below should read:
في رأس النهج تبينت مدقة الباب النحاسية التي استقرت فوق الخد
الخشبي الضخم علي هيئة قبضة يد نصف مضمومة. الهدوء

import pdfplumber

file = 'sample_page.pdf'
pdf = pdfplumber.open(file)
page = pdf.pages[0]
text = page.extract_text()
print(text[:110])

output:
دّ لخا قوف ترّ قتسا يتلا ةيّ ساحنلا بابلا ةقّ دم تنيّ بت جهنلا سأر في لاإ مٌ يّ مخ ءودلها .ةمومضم فصن دي ةض

^ This is backwards and all there are spaces next to the diacritics

# Reverse text with bidi
from bidi import algorithm

text_rev = algorithm.get_display(text)
print(text_rev[:110])

output:
يف رأس النهج تب ّينت مد ّقة الباب النحاس ّية التي استق ّرت فوق اخل ّد 
اخلشب ّي الضخم عىل هيئة قبضة يد نصف مضم

^ Not backwards anymore, but still the diacritic problem

# Strip most common diacritic — in real use you would need to get all of them
shadda = unichr(0x0651)
text_rev_dediac = text_rev.replace(" "+shadda, '')
print(text_rev_dediac[:110])

output:
يف رأس النهج تبينت مدقة الباب النحاسية التي استقرت فوق اخلد 
اخلشبي الضخم عىل هيئة قبضة يد نصف مضمومة. اهلدوء 

^ This is right, except where the stacked letters are in the wrong order (like the first word is supposed to be في (fy 'in') but instead it's يف (yf). You can see that the period (after the word مضمومة) is still in the correct place, though. So this is pretty suceessful, and might be 100% accurate with an easier font.

Good luck!

huangapple
  • 本文由 发表于 2023年1月9日 02:26:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75050321.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定