英文:
Preserving indentation with Tesseract OCR 4.x
问题
我在使用Tesseract OCR时遇到了困难。
我有一张血液检查图像,其中包含一个带有缩进的表格。尽管Tesseract能够非常好地识别字符,但其结构在最终输出中没有被保留。例如,请查看“Emocromo con formula”(英文翻译:带有公式的血液计数)下面的缩进行。我想保留这个缩进。
我阅读了其他相关的讨论,并找到了选项preserve_interword_spaces=1
。结果稍微好一些,但正如你所看到的,仍然不完美。
有什么建议吗?
更新:
我尝试了Tesseract v5.0,结果还是一样。
代码:
Tesseract版本是4.0.0.20190314
from PIL import Image
import pytesseract
# 保留单词间的空格设置为1,oem = 1是LSTM,
# PSM = 1是具有OSD的自动页面分割 - 方向和脚本检测
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
# 默认配置 = r'-c -l eng+ita'
extracted_text = pytesseract.image_to_string(Image.open('referto-1.jpg'), config=custom_config)
print(extracted_text)
# 保存到txt文件
with open("referto.txt", "w") as text_file:
text_file.write(extracted_text)
带比较的结果:
GITHUB:
我创建了一个 GitHub 存储库,如果你想自己尝试。
感谢你的帮助和时间。
英文:
I'm struggling with Tesseract OCR.
I have a blood examination image, it has a table with indentation. Although tesseract recognizes the characters very well, its structure isn't preserved in the final output. For example, look the lines below "Emocromo con formula" (Eng. Translation: blood count with formula) that are indented. I want to preserve that indentation.
I read the other related discussions and I found the option preserve_interword_spaces=1
. The result became slightly better but as you can see, it isn't perfect.
Any suggestions?
Update:
I tried Tesseract v5.0 and the result is the same.
Code:
Tesseract version is 4.0.0.20190314
from PIL import Image
import pytesseract
# Preserve interword spaces is set to 1, oem = 1 is LSTM,
# PSM = 1 is Automatic page segmentation with OSD - Orientation and script detection
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
# default_config = r'-c -l eng+ita'
extracted_text = pytesseract.image_to_string(Image.open('referto-1.jpg'), config=custom_config)
print(extracted_text)
# saving to a txt file
with open("referto.txt", "w") as text_file:
text_file.write(extracted_text)
Result with comparison:
GITHUB:
I have created a GitHub repository if you want to try it yourself.
Thanks for your help and your time
答案1
得分: 18
image_to_data()
函数提供了更多信息。对于每个单词,它会返回其边界矩形。你可以使用它。
Tesseract
会自动将图像分割成块。然后,你可以根据它们的垂直位置对块进行排序,并针对每个块找到字符的平均宽度(这取决于块识别的字体)。然后,对于块中的每个单词,检查它是否靠近前一个单词,如果不是,则相应地添加空格。我使用 pandas
来简化计算,但不必要使用它。不要忘记结果应该使用等宽字体显示。
import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
d = pytesseract.image_to_data(Image.open(r'referto-2.jpg'), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)
# 清除空白
df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
# 垂直排序块
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
curr = df1[df1['block_num']==block]
sel = curr[curr.text.str.len()>3]
char_w = (sel.width/sel.text.str.len()).mean()
prev_par, prev_line, prev_left = 0, 0, 0
text = ''
for ix, ln in curr.iterrows():
# 需要时添加新行
if prev_par != ln['par_num']:
text += '\n'
prev_par = ln['par_num']
prev_line = ln['line_num']
prev_left = 0
elif prev_line != ln['line_num']:
text += '\n'
prev_line = ln['line_num']
prev_left = 0
added = 0 # 应该添加的空格数
if ln['left']/char_w > prev_left + 1:
added = int((ln['left'])/char_w) - prev_left
text += ' ' * added
text += ln['text'] + ' '
prev_left += len(ln['text']) + added + 1
text += '\n'
print(text)
此代码将产生以下输出:
ssseeess+ SERVIZIO SANITARIO REGIONALE Pagina 2 di3
seoeeeees EMILIA-RROMAGNA
©2888 800
©9868 6 006 : pe ‘ ‘ "
©171;ee @@e@ecee Azienda Unita Sanitaria Locale di Modena
Seat se ces Amends Ospedaliero-Universitaria Policlinico di Modena
Dipartimento interaziendale ad attivita integrata di Medicina di Laboratorio e Anatomia Patologica
Direttore dr. T.Trenti
Ospedale Civile S.Agostino-Estense
S.C. Medicina di Laboratorio
S.S. Patologia Clinica - Corelab
Sistema di Gestione per la Qualita certificato UNI EN ISO 9001:2015
Responsabile dr.ssa M.Varani
Richiesta (CDA): 49/073914 Data di accettazione: 18/12/2018
Data di check-in: 18/12/2018 10:27:06
Referto del 18/12/2018 16:39:53
Provenienza: D4-cp sassuolo
Sig.
Data di Nascita:
Domicilio:
ANALISI RISULTATO __UNITA'DI MISURA VALORI DI RIFERIMENTO
Glucosio 95 mg/dl (70 - 110 )
Creatinina 1.03 mg/dl ( 0.50 - 1.40 )
eGFR Filtrato glomerulare stimato >60 ml/min Cut-off per rischio di I.R.
7 <60. Il calcolo è€ riferito
Equazione CKD-EPI ad una superfice corporea
Standard (1,73 mq)x In Caso
di etnia afroamericana
moltiplicare per il fattore
1,159.
Colesterolo 212 * mg/dl < 200 v.desiderabile
Trigliceridi 106 mg/dl < 180 v.desiderabile
Bilirubina totale 0.60 mg/dl ( 0.16 - 1.10 )
Bilirubina diretta 0.10 mg/dl ( 0.01 - 0.3 )
GOT - AST 17 U/L (1-37)
GPT - ALT ay U/L (1- 40 )
Gamma-GT 15 U/L (1-55)
Sodio 142 mEq/L ( 136 - 146 )
Potassio 4.3 mEq/L (3.5 - 5.3)
Vitamina B12 342 pg/ml ( 200 - 960 )
TSH 5.47 * ulU/ml (0.35 - 4.94 )
FT4 9.7 pg/ml (7 = 15)
Urine chimico fisico morfologico
u-Colore giallo paglierino
u-Peso specifico 1.012 ( 1.010 - 1.027 )
u-pH 5.5 (5.5 - 6.5)
u-Glucosio assente mg/dl assente
u-Proteine assente mg/dl (0 -10 )
u-Emoglobina assente mg/dl assente
u-Corpi chetonici assente mg/dl assente
u-Bilirubina assente mg/dl assente
u-Urobilinogeno 0.20 mg/dl (0- 1.0 )
sedimento non significativo
Il Laureato:
Dott. CRISTINA ROTA
Per ogni informazione
<details>
<summary>英文:</summary>
`image_to_data()` function provides much more information. For each word it will return it's bounding rectangle. You can use that.
`Tesseract` segments the image automatically to blocks. Then you can sort block by their vertical position and for each block you can find mean character width (that depends on the block's recognized font). Then for each word in the block check if it is close to the previous one, if not add spaces accordingly. I'm using `pandas` to ease on calculations, but it's usage is not necessary. Don't forget that the result should be displayed using monospaced font.
import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
d = pytesseract.image_to_data(Image.open(r'referto-2.jpg'), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)
# clean up blanks
df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
curr = df1[df1['block_num']==block]
sel = curr[curr.text.str.len()>3]
char_w = (sel.width/sel.text.str.len()).mean()
prev_par, prev_line, prev_left = 0, 0, 0
text = ''
for ix, ln in curr.iterrows():
# add new line when necessary
if prev_par != ln['par_num']:
text += '\n'
prev_par = ln['par_num']
prev_line = ln['line_num']
prev_left = 0
elif prev_line != ln['line_num']:
text += '\n'
prev_line = ln['line_num']
prev_left = 0
added = 0 # num of spaces that should be added
if ln['left']/char_w > prev_left + 1:
added = int((ln['left'])/char_w) - prev_left
text += ' ' * added
text += ln['text'] + ' '
prev_left += len(ln['text']) + added + 1
text += '\n'
print(text)
This code will produce following output:
ssseeess+ SERVIZIO SANITARIO REGIONALE Pagina 2 di3
seoeeeees EMILIA-RROMAGNA
©2888 800
©9868 6 006 : pe ‘ ‘ "
«ee @@e@ecee Azienda Unita Sanitaria Locale di Modena
Seat se ces Amends Ospedaliero-Universitaria Policlinico di Modena
Dipartimento interaziendale ad attivita integrata di Medicina di Laboratorio e Anatomia Patologica
Direttore dr. T.Trenti
Ospedale Civile S.Agostino-Estense
S.C. Medicina di Laboratorio
S.S. Patologia Clinica - Corelab
Sistema di Gestione per la Qualita certificato UNI EN ISO 9001:2015
Responsabile dr.ssa M.Varani
Richiesta (CDA): 49/073914 Data di accettazione: 18/12/2018
Data di check-in: 18/12/2018 10:27:06
Referto del 18/12/2018 16:39:53
Provenienza: D4-cp sassuolo
Sig.
Data di Nascita:
Domicilio:
ANALISI RISULTATO __UNITA'DI MISURA VALORI DI RIFERIMENTO
Glucosio 95 mg/dl (70 - 110 )
Creatinina 1.03 mg/dl ( 0.50 - 1.40 )
eGFR Filtrato glomerulare stimato >60 ml/min Cut-off per rischio di I.R.
7 <60. Il calcolo é€ riferito
Equazione CKD-EPI ad una superfice corporea
Standard (1,73 mq)x In Caso
di etnia afroamericana
moltiplicare per il fattore
1,159.
Colesterolo 212 * mg/dl < 200 v.desiderabile
Trigliceridi 106 mg/dl < 180 v.desiderabile
Bilirubina totale 0.60 mg/dl ( 0.16 - 1.10 )
Bilirubina diretta 0.10 mg/dl ( 0.01 - 0.3 )
GOT - AST 17 U/L (1-37)
GPT - ALT ay U/L (1- 40 )
Gamma-GT 15 U/L (1-55)
Sodio 142 mEq/L ( 136 - 146 )
Potassio 4.3 mEq/L (3.5 - 5.3)
Vitamina B12 342 pg/ml ( 200 - 960 )
TSH 5.47 * ulU/ml (0.35 - 4.94 )
FT4 9.7 pg/ml (7 = 15)
Urine chimico fisico morfologico
u-Colore giallo paglierino
u-Peso specifico 1.012 ( 1.010 - 1.027 )
u-pH 5.5 (5.5 - 6.5)
u-Glucosio assente mg/dl assente
u-Proteine assente mg/dl (0 -10 )
u-Emoglobina assente mg/dl assente
u-Corpi chetonici assente mg/dl assente
u-Bilirubina assente mg/dl assente
u-Urobilinogeno 0.20 mg/dl (0- 1.0 )
sedimento non significativo
Il Laureato:
Dott. CRISTINA ROTA
Per ogni informazione o chiarimento sugli aspetti medici, puo rivolgersi al suo medico curante
Referto firmato elettronicamente secondo le norme vigenti: Legge 15 marzo 1997, n. 59; D.P.R. 10 novembre 1997, n.513;
D.P.C.M. 8 febbraio 1999; D.P.R 28 dicembre 2000, n.445; D.L. 23 gennaio 2002, n.10.
Certificato rilasciato da: Infocamere S.C.p.A. (http://www.card.infocamere. it)
i! Laureato: Dr. CRISTINA ROTA
1! documento informatico originale 6 conservato presso Parer - Polo Archivistico della Regione Emilia-Romagna
</details>
# 答案2
**得分**: -1
为了OCR图像并合理保留间距和缩进,而不是使用pytesseract的'image_to_data',我使用了它的'hOCR'格式来收集位置信息并重建格式。
我进行了自己的实现,还包括了一个参数,可以直接使用@igrinis的实现。我还实施了不同的知名算法,在运行OCR之前应用“智能”灰度。
这是链接:https://pypi.org/project/OCR-with-format/
<details>
<summary>英文:</summary>
To OCR image while reasonnably keeping the spacings and indentation instead of using pytesseract 'image_to_data' I used it's 'hOCR' format to gather the location information and reconstruct the format.
I did an implementation of my own but also included an argument to use directly @igrinis 's implementation. I also implemented different well known algorithm to apply "smart" grayscale before running the OCR.
Here it is: https://pypi.org/project/OCR-with-format/
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论