2020年1月4日 00:27:49go评论87阅读模式

英文:

Preserving indentation with Tesseract OCR 4.x

问题

我在使用Tesseract OCR时遇到了困难。
我有一张血液检查图像，其中包含一个带有缩进的表格。尽管Tesseract能够非常好地识别字符，但其结构在最终输出中没有被保留。例如，请查看“Emocromo con formula”（英文翻译：带有公式的血液计数）下面的缩进行。我想保留这个缩进。

我阅读了其他相关的讨论，并找到了选项preserve_interword_spaces=1。结果稍微好一些，但正如你所看到的，仍然不完美。

有什么建议吗？

更新：

我尝试了Tesseract v5.0，结果还是一样。

代码：

Tesseract版本是4.0.0.20190314

from PIL import Image
import pytesseract

# 保留单词间的空格设置为1，oem = 1是LSTM，
# PSM = 1是具有OSD的自动页面分割 - 方向和脚本检测

custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'

# 默认配置 = r'-c -l eng+ita'

extracted_text = pytesseract.image_to_string(Image.open('referto-1.jpg'), config=custom_config)

print(extracted_text)

# 保存到txt文件

with open("referto.txt", "w") as text_file:
    text_file.write(extracted_text)

带比较的结果：

GITHUB：

我创建了一个 GitHub 存储库，如果你想自己尝试。

感谢你的帮助和时间。

英文:

I'm struggling with Tesseract OCR.
I have a blood examination image, it has a table with indentation. Although tesseract recognizes the characters very well, its structure isn't preserved in the final output. For example, look the lines below "Emocromo con formula" (Eng. Translation: blood count with formula) that are indented. I want to preserve that indentation.

I read the other related discussions and I found the option preserve_interword_spaces=1. The result became slightly better but as you can see, it isn't perfect.

Any suggestions?

Update:

I tried Tesseract v5.0 and the result is the same.

Code:

Tesseract version is 4.0.0.20190314

from PIL import Image
import pytesseract

# Preserve interword spaces is set to 1, oem = 1 is LSTM, 
# PSM = 1 is Automatic page segmentation with OSD - Orientation and script detection

custom_config = r&#39;-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita&#39;

# default_config = r&#39;-c -l eng+ita&#39;

extracted_text = pytesseract.image_to_string(Image.open(&#39;referto-1.jpg&#39;), config=custom_config)

print(extracted_text)

# saving to a txt file

with open(&quot;referto.txt&quot;, &quot;w&quot;) as text_file:
    text_file.write(extracted_text)

Result with comparison:

GITHUB:

I have created a GitHub repository if you want to try it yourself.

Thanks for your help and your time

答案1

得分: 18

image_to_data() 函数提供了更多信息。对于每个单词，它会返回其边界矩形。你可以使用它。

Tesseract 会自动将图像分割成块。然后，你可以根据它们的垂直位置对块进行排序，并针对每个块找到字符的平均宽度（这取决于块识别的字体）。然后，对于块中的每个单词，检查它是否靠近前一个单词，如果不是，则相应地添加空格。我使用 pandas 来简化计算，但不必要使用它。不要忘记结果应该使用等宽字体显示。

import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd

custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
d = pytesseract.image_to_data(Image.open(r'referto-2.jpg'), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)

# 清除空白
df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
# 垂直排序块
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
    curr = df1[df1['block_num']==block]
    sel = curr[curr.text.str.len()>3]
    char_w = (sel.width/sel.text.str.len()).mean()
    prev_par, prev_line, prev_left = 0, 0, 0
    text = ''
    for ix, ln in curr.iterrows():
        # 需要时添加新行
        if prev_par != ln['par_num']:
            text += '\n'
            prev_par = ln['par_num']
            prev_line = ln['line_num']
            prev_left = 0
        elif prev_line != ln['line_num']:
            text += '\n'
            prev_line = ln['line_num']
            prev_left = 0

        added = 0  # 应该添加的空格数
        if ln['left']/char_w > prev_left + 1:
            added = int((ln['left'])/char_w) - prev_left
            text += ' ' * added 
        text += ln['text'] + ' '
        prev_left += len(ln['text']) + added + 1
    text += '\n'
    print(text)

此代码将产生以下输出：

ssseeess+ SERVIZIO SANITARIO REGIONALE                          Pagina 2 di3 
seoeeeees EMILIA-RROMAGNA 
©2888   800 
©9868  6 006   :       pe   ‘  ‘        " 
©171;ee @@e@ecee Azienda Unita Sanitaria Locale di Modena 
Seat se  ces Amends Ospedaliero-Universitaria Policlinico di Modena 
Dipartimento  interaziendale ad attivita integrata di Medicina di Laboratorio e Anatomia Patologica 
Direttore dr. T.Trenti 
Ospedale Civile S.Agostino-Estense 
S.C. Medicina  di Laboratorio 
S.S. Patologia  Clinica - Corelab 
Sistema di Gestione per la Qualita certificato UNI EN ISO 9001:2015 
Responsabile dr.ssa M.Varani 
Richiesta (CDA):   49/073914                                    Data di accettazione: 18/12/2018 
Data di check-in:    18/12/2018 10:27:06 
Referto del          18/12/2018 16:39:53 
Provenienza:         D4-cp sassuolo 
Sig. 
Data di Nascita: 
Domicilio: 
ANALISI                                              RISULTATO  __UNITA'DI MISURA VALORI DI RIFERIMENTO 
Glucosio                                                     95     mg/dl            (70  - 110 ) 
Creatinina                                                 1.03     mg/dl            ( 0.50 - 1.40 ) 
eGFR  Filtrato glomerulare stimato                         >60      ml/min           Cut-off per rischio di  I.R. 
7                                                                              <60. Il calcolo è€ riferito 
Equazione  CKD-EPI                                                                   ad una superfice corporea 
Standard  (1,73 mq)x In Caso 
di etnia afroamericana 
moltiplicare per  il fattore 
1,159. 
Colesterolo                                                212   *  mg/dl            < 200 v.desiderabile 
Trigliceridi                                                106     mg/dl            < 180 v.desiderabile 
Bilirubina totale                                          0.60     mg/dl            ( 0.16 - 1.10 ) 
Bilirubina diretta                                         0.10     mg/dl            ( 0.01 - 0.3 ) 
GOT  - AST                                                   17     U/L              (1-37) 
GPT  - ALT                                                   ay     U/L              (1-   40 ) 
Gamma-GT                                                     15     U/L              (1-55) 
Sodio                                                       142     mEq/L            ( 136 - 146 ) 
Potassio                                                    4.3     mEq/L            (3.5  - 5.3) 
Vitamina B12                                               342      pg/ml            ( 200 - 960 ) 
TSH                                                        5.47  *  ulU/ml           (0.35  - 4.94 ) 
FT4                                                         9.7     pg/ml            (7  = 15) 
Urine chimico fisico morfologico 
u-Colore                                     giallo paglierino 
u-Peso specifico                                       1.012                      ( 1.010 - 1.027  ) 
u-pH                                                     5.5                      (5.5  - 6.5) 
u-Glucosio                                           assente     mg/dl            assente 
u-Proteine                                           assente     mg/dl            (0  -10 ) 
u-Emoglobina                                         assente     mg/dl            assente 
u-Corpi chetonici                                    assente     mg/dl            assente 
u-Bilirubina                                         assente     mg/dl            assente 
u-Urobilinogeno                                         0.20     mg/dl            (0-   1.0 ) 
sedimento                                    non significativo 
Il Laureato: 
Dott. CRISTINA ROTA 
Per ogni informazione
<details>
<summary>英文:</summary>
`image_to_data()` function provides much more information. For each word it will return it&#39;s bounding rectangle. You can use that. 
`Tesseract` segments the image automatically to blocks. Then you can sort block by their vertical position and for each block you can find mean character width (that depends on the block&#39;s recognized font). Then for each word in the block check if it is close to the previous one, if not add spaces accordingly. I&#39;m using `pandas` to ease on calculations, but it&#39;s usage is not necessary. Don&#39;t forget that the result should be displayed using monospaced font.
import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd
custom_config = r&#39;-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita&#39;
d = pytesseract.image_to_data(Image.open(r&#39;referto-2.jpg&#39;), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)
# clean up blanks
df1 = df[(df.conf!=&#39;-1&#39;)&amp;(df.text!=&#39; &#39;)&amp;(df.text!=&#39;&#39;)]
# sort blocks vertically
sorted_blocks = df1.groupby(&#39;block_num&#39;).first().sort_values(&#39;top&#39;).index.tolist()
for block in sorted_blocks:
curr = df1[df1[&#39;block_num&#39;]==block]
sel = curr[curr.text.str.len()&gt;3]
char_w = (sel.width/sel.text.str.len()).mean()
prev_par, prev_line, prev_left = 0, 0, 0
text = &#39;&#39;
for ix, ln in curr.iterrows():
# add new line when necessary
if prev_par != ln[&#39;par_num&#39;]:
text += &#39;\n&#39;
prev_par = ln[&#39;par_num&#39;]
prev_line = ln[&#39;line_num&#39;]
prev_left = 0
elif prev_line != ln[&#39;line_num&#39;]:
text += &#39;\n&#39;
prev_line = ln[&#39;line_num&#39;]
prev_left = 0
added = 0  # num of spaces that should be added
if ln[&#39;left&#39;]/char_w &gt; prev_left + 1:
added = int((ln[&#39;left&#39;])/char_w) - prev_left
text += &#39; &#39; * added 
text += ln[&#39;text&#39;] + &#39; &#39;
prev_left += len(ln[&#39;text&#39;]) + added + 1
text += &#39;\n&#39;
print(text)
This code will produce following output:
ssseeess+ SERVIZIO SANITARIO REGIONALE                          Pagina 2 di3 
seoeeeees EMILIA-RROMAGNA 
&#169;2888   800 
&#169;9868  6 006   :       pe   ‘  ‘        &quot; 
&#171;ee @@e@ecee Azienda Unita Sanitaria Locale di Modena 
Seat se  ces Amends Ospedaliero-Universitaria Policlinico di Modena 
Dipartimento  interaziendale ad attivita integrata di Medicina di Laboratorio e Anatomia Patologica 
Direttore dr. T.Trenti 
Ospedale Civile S.Agostino-Estense 
S.C. Medicina  di Laboratorio 
S.S. Patologia  Clinica - Corelab 
Sistema di Gestione per la Qualita certificato UNI EN ISO 9001:2015 
Responsabile dr.ssa M.Varani 
Richiesta (CDA):   49/073914                                    Data di accettazione: 18/12/2018 
Data di check-in:    18/12/2018 10:27:06 
Referto del          18/12/2018 16:39:53 
Provenienza:         D4-cp sassuolo 
Sig. 
Data di Nascita: 
Domicilio: 
ANALISI                                              RISULTATO  __UNITA&#39;DI MISURA VALORI DI RIFERIMENTO 
Glucosio                                                     95     mg/dl            (70  - 110 ) 
Creatinina                                                 1.03     mg/dl            ( 0.50 - 1.40 ) 
eGFR  Filtrato glomerulare stimato                         &gt;60      ml/min           Cut-off per rischio di  I.R. 
7                                                                              &lt;60. Il calcolo &#233;€ riferito 
Equazione  CKD-EPI                                                                   ad una superfice corporea 
Standard  (1,73 mq)x In Caso 
di etnia afroamericana 
moltiplicare per  il fattore 
1,159. 
Colesterolo                                                212   *  mg/dl            &lt; 200 v.desiderabile 
Trigliceridi                                                106     mg/dl            &lt; 180 v.desiderabile 
Bilirubina totale                                          0.60     mg/dl            ( 0.16 - 1.10 ) 
Bilirubina diretta                                         0.10     mg/dl            ( 0.01 - 0.3 ) 
GOT  - AST                                                   17     U/L              (1-37) 
GPT  - ALT                                                   ay     U/L              (1-   40 ) 
Gamma-GT                                                     15     U/L              (1-55) 
Sodio                                                       142     mEq/L            ( 136 - 146 ) 
Potassio                                                    4.3     mEq/L            (3.5  - 5.3) 
Vitamina B12                                               342      pg/ml            ( 200 - 960 ) 
TSH                                                        5.47  *  ulU/ml           (0.35  - 4.94 ) 
FT4                                                         9.7     pg/ml            (7  = 15) 
Urine chimico fisico morfologico 
u-Colore                                     giallo paglierino 
u-Peso specifico                                       1.012                      ( 1.010 - 1.027  ) 
u-pH                                                     5.5                      (5.5  - 6.5) 
u-Glucosio                                           assente     mg/dl            assente 
u-Proteine                                           assente     mg/dl            (0  -10 ) 
u-Emoglobina                                         assente     mg/dl            assente 
u-Corpi chetonici                                    assente     mg/dl            assente 
u-Bilirubina                                         assente     mg/dl            assente 
u-Urobilinogeno                                         0.20     mg/dl            (0-   1.0 ) 
sedimento                                    non significativo 
Il Laureato: 
Dott. CRISTINA ROTA 
Per ogni informazione o chiarimento sugli aspetti medici, puo rivolgersi al suo medico curante 
Referto firmato elettronicamente secondo le norme vigenti: Legge 15 marzo 1997, n. 59; D.P.R. 10 novembre 1997, n.513; 
D.P.C.M. 8 febbraio 1999; D.P.R 28 dicembre 2000, n.445; D.L. 23 gennaio 2002, n.10. 
Certificato rilasciato da: Infocamere S.C.p.A. (http://www.card.infocamere. it) 
i! Laureato: Dr. CRISTINA ROTA 
1! documento informatico originale 6 conservato presso Parer - Polo Archivistico della Regione Emilia-Romagna 
</details>
# 答案2
**得分**: -1
为了OCR图像并合理保留间距和缩进，而不是使用pytesseract的'image_to_data'，我使用了它的'hOCR'格式来收集位置信息并重建格式。
我进行了自己的实现，还包括了一个参数，可以直接使用@igrinis的实现。我还实施了不同的知名算法，在运行OCR之前应用“智能”灰度。
这是链接：https://pypi.org/project/OCR-with-format/
<details>
<summary>英文:</summary>
To OCR image while reasonnably keeping the spacings and indentation instead of using pytesseract &#39;image_to_data&#39; I used it&#39;s &#39;hOCR&#39; format to gather the location information and reconstruct the format.
I did an implementation of my own but also included an argument to use directly @igrinis &#39;s implementation. I also implemented different well known algorithm to apply &quot;smart&quot; grayscale before running the OCR.
Here it is: https://pypi.org/project/OCR-with-format/
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

保留缩进与Tesseract OCR 4.x

问题

答案1

如何修复在Streamlit应用的Matplotlib折线图中图例条目不匹配的问题？

将文本文件中较长的句子在逗号（多个逗号）处分成两部分。

我可以使用 Firebase Firestore Python SDK 对子项进行排序吗？

从Python中使用另一个列表的占位符来移除列表中的元素。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论