问题

我正在进行一个React.js项目，几乎已经完成了，但我的问题是，如果我想使用Tesseract（OCR）将两列/三列图像转换为文本，它不会按我想要的方式进行转换。因为两列的文本混在一起了，无法按列单独转换。有没有可能以某种方式解决这个问题？

我尝试过使用opencv.js，但无法解决它。

英文:

I am working on a react.js project, I have almost done but my problem is if I want to convert two/three column images to text by Tesseract (OCR) does not convert as I want. because two columns' text is mixed. no separately convert by column. can possibel to solve this problem anyway?

enter image description here

import React, { useState, useEffect } from &quot;react&quot;;
import Tesseract from &quot;tesseract.js&quot;;
import ClipboardJS from &quot;clipboard&quot;;
import Select from &quot;react-select&quot;;
const languageOptions = [
{ value: &quot;afr&quot;, label: &quot;Afrikaans&quot; },
{ value: &quot;amh&quot;, label: &quot;Amharic&quot; },
{ value: &quot;ara&quot;, label: &quot;Arabic&quot; },
{ value: &quot;asm&quot;, label: &quot;Assamese&quot; },
{ value: &quot;aze&quot;, label: &quot;Azerbaijani&quot; },
{ value: &quot;aze_cyrl&quot;, label: &quot;Azerbaijani - Cyrillic&quot; },
{ value: &quot;bel&quot;, label: &quot;Belarusian&quot; },
{ value: &quot;ben&quot;, label: &quot;Bengali&quot; },
{ value: &quot;bod&quot;, label: &quot;Tibetan&quot; },
{ value: &quot;bos&quot;, label: &quot;Bosnian&quot; },
{ value: &quot;bul&quot;, label: &quot;Bulgarian&quot; },
{ value: &quot;cat&quot;, label: &quot;Catalan; Valencian&quot; },
{ value: &quot;ceb&quot;, label: &quot;Cebuano&quot; },
{ value: &quot;ces&quot;, label: &quot;Czech&quot; },
{ value: &quot;chi_sim&quot;, label: &quot;Chinese - Simplified&quot; },
{ value: &quot;chi_tra&quot;, label: &quot;Chinese - Traditional&quot; },
{ value: &quot;chr&quot;, label: &quot;Cherokee&quot; },
{ value: &quot;cym&quot;, label: &quot;Welsh&quot; },
{ value: &quot;dan&quot;, label: &quot;Danish&quot; },
{ value: &quot;deu&quot;, label: &quot;German&quot; },
{ value: &quot;dzo&quot;, label: &quot;Dzongkha&quot; },
{ value: &quot;ell&quot;, label: &quot;Greek, Modern (1453-)&quot; },
{ value: &quot;eng&quot;, label: &quot;English&quot; },
{ value: &quot;enm&quot;, label: &quot;English, Middle (1100-1500)&quot; },
{ value: &quot;epo&quot;, label: &quot;Esperanto&quot; },
{ value: &quot;est&quot;, label: &quot;Estonian&quot; },
{ value: &quot;eus&quot;, label: &quot;Basque&quot; },
{ value: &quot;fas&quot;, label: &quot;Persian&quot; },
{ value: &quot;fin&quot;, label: &quot;Finnish&quot; },
{ value: &quot;fra&quot;, label: &quot;French&quot; },
{ value: &quot;frk&quot;, label: &quot;German Fraktur&quot; },
{ value: &quot;frm&quot;, label: &quot;French, Middle (ca. 1400-1600)&quot; },
{ value: &quot;gle&quot;, label: &quot;Irish&quot; },
{ value: &quot;glg&quot;, label: &quot;Galician&quot; },
{ value: &quot;grc&quot;, label: &quot;Greek, Ancient (-1453)&quot; },
{ value: &quot;guj&quot;, label: &quot;Gujarati&quot; },
{ value: &quot;hat&quot;, label: &quot;Haitian; Haitian Creole&quot; },
{ value: &quot;heb&quot;, label: &quot;Hebrew&quot; },
{ value: &quot;hin&quot;, label: &quot;Hindi&quot; },
{ value: &quot;hrv&quot;, label: &quot;Croatian&quot; },
{ value: &quot;hun&quot;, label: &quot;Hungarian&quot; },
{ value: &quot;iku&quot;, label: &quot;Inuktitut&quot; },
{ value: &quot;ind&quot;, label: &quot;Indonesian&quot; },
{ value: &quot;isl&quot;, label: &quot;Icelandic&quot; },
{ value: &quot;ita&quot;, label: &quot;Italian&quot; },
{ value: &quot;ita_old&quot;, label: &quot;Italian - Old&quot; },
{ value: &quot;jav&quot;, label: &quot;Javanese&quot; },
{ value: &quot;jpn&quot;, label: &quot;Japanese&quot; },
{ value: &quot;kan&quot;, label: &quot;Kannada&quot; },
{ value: &quot;kat&quot;, label: &quot;Georgian&quot; },
{ value: &quot;kat_old&quot;, label: &quot;Georgian - Old&quot; },
{ value: &quot;kaz&quot;, label: &quot;Kazakh&quot; },
{ value: &quot;khm&quot;, label: &quot;Central Khmer&quot; },
{ value: &quot;kir&quot;, label: &quot;Kirghiz; Kyrgyz&quot; },
{ value: &quot;kor&quot;, label: &quot;Korean&quot; },
{ value: &quot;kur&quot;, label: &quot;Kurdish&quot; },
{ value: &quot;lao&quot;, label: &quot;Lao&quot; },
{ value: &quot;lat&quot;, label: &quot;Latin&quot; },
{ value: &quot;lav&quot;, label: &quot;Latvian&quot; },
{ value: &quot;lit&quot;, label: &quot;Lithuanian&quot; },
{ value: &quot;mal&quot;, label: &quot;Malayalam&quot; },
{ value: &quot;mar&quot;, label: &quot;Marathi&quot; },
{ value: &quot;mkd&quot;, label: &quot;Macedonian&quot; },
{ value: &quot;mlt&quot;, label: &quot;Maltese&quot; },
{ value: &quot;msa&quot;, label: &quot;Malay&quot; },
{ value: &quot;mya&quot;, label: &quot;Burmese&quot; },
{ value: &quot;nep&quot;, label: &quot;Nepali&quot; },
{ value: &quot;nld&quot;, label: &quot;Dutch; Flemish&quot; },
{ value: &quot;nor&quot;, label: &quot;Norwegian&quot; },
{ value: &quot;ori&quot;, label: &quot;Oriya&quot; },
{ value: &quot;pan&quot;, label: &quot;Panjabi; Punjabi&quot; },
{ value: &quot;pol&quot;, label: &quot;Polish&quot; },
{ value: &quot;por&quot;, label: &quot;Portuguese&quot; },
{ value: &quot;pus&quot;, label: &quot;Pushto; Pashto&quot; },
{ value: &quot;ron&quot;, label: &quot;Romanian; Moldavian; Moldovan&quot; },
{ value: &quot;rus&quot;, label: &quot;Russian&quot; },
{ value: &quot;san&quot;, label: &quot;Sanskrit&quot; },
{ value: &quot;sin&quot;, label: &quot;Sinhala; Sinhalese&quot; },
{ value: &quot;slk&quot;, label: &quot;Slovak&quot; },
{ value: &quot;slv&quot;, label: &quot;Slovenian&quot; },
{ value: &quot;spa&quot;, label: &quot;Spanish; Castilian&quot; },
{ value: &quot;spa_old&quot;, label: &quot;Spanish; Castilian - Old&quot; },
{ value: &quot;sqi&quot;, label: &quot;Albanian&quot; },
{ value: &quot;srp&quot;, label: &quot;Serbian&quot; },
{ value: &quot;srp_latn&quot;, label: &quot;Serbian - Latin&quot; },
{ value: &quot;swa&quot;, label: &quot;Swahili&quot; },
{ value: &quot;swe&quot;, label: &quot;Swedish&quot; },
{ value: &quot;syr&quot;, label: &quot;Syriac&quot; },
{ value: &quot;tam&quot;, label: &quot;Tamil&quot; },
{ value: &quot;tel&quot;, label: &quot;Telugu&quot; },
{ value: &quot;tgk&quot;, label: &quot;Tajik&quot; },
{ value: &quot;tgl&quot;, label: &quot;Tagalog&quot; },
{ value: &quot;tha&quot;, label: &quot;Thai&quot; },
{ value: &quot;tir&quot;, label: &quot;Tigrinya&quot; },
{ value: &quot;tur&quot;, label: &quot;Turkish&quot; },
{ value: &quot;uig&quot;, label: &quot;Uighur; Uyghur&quot; },
{ value: &quot;ukr&quot;, label: &quot;Ukrainian&quot; },
{ value: &quot;urd&quot;, label: &quot;Urdu&quot; },
{ value: &quot;uzb&quot;, label: &quot;Uzbek&quot; },
{ value: &quot;uzb_cyrl&quot;, label: &quot;Uzbek - Cyrillic&quot; },
{ value: &quot;vie&quot;, label: &quot;Vietnamese&quot; },
{ value: &quot;yid&quot;, label: &quot;Yiddish&quot; }
];
const ImagesToText = () =&gt; {
const [isLoading, setIsLoading] = useState(false);
const [images, setImages] = useState([]);
const [texts, setTexts] = useState([]);
const [progress, setProgress] = useState(0);
const [currentImageIndex, setCurrentImageIndex] = useState(0);
const [errorMessage, setErrorMessage] = useState(&quot;&quot;);
const [errorLanguagesMessage, setErrorLanguagesMessage] = useState(&quot;&quot;);
const [selectedLanguages, setSelectedLanguages] = useState([]);
const handleImageUpload = (e) =&gt; {
const selectedImages = Array.from(e.target.files);
setImages(selectedImages);
setErrorMessage(&quot;&quot;);
};
const handleCopyText = () =&gt; {
const textWithSoftLineBreaks = texts.join(&quot;\n&quot;);
navigator.clipboard.writeText(textWithSoftLineBreaks);
};
const handleDownloadText = () =&gt; {
const element = document.createElement(&quot;a&quot;);
const textBlob = new Blob([texts.join(&quot;\n&quot;)], { type: &quot;text/plain&quot; });
element.href = URL.createObjectURL(textBlob);
element.download = &quot;converted_text.txt&quot;;
document.body.appendChild(element);
element.click();
document.body.removeChild(element);
};
useEffect(() =&gt; {
const clipboard = new ClipboardJS(&quot;.copy-button&quot;);
clipboard.on(&quot;success&quot;, (e) =&gt; {
e.clearSelection();
});
return () =&gt; {
clipboard.destroy();
};
}, [texts]);
const handleReset = () =&gt; {
setIsLoading(false);
setImages([]);
setTexts([]);
setProgress(0);
setCurrentImageIndex(0);
setErrorMessage(&quot;&quot;);
setErrorLanguagesMessage(&quot;&quot;);
window.location.reload();
};
const handleSubmit = async () =&gt; {
if (images.length === 0) {
setErrorMessage(&quot;Select an image to convert.&quot;);
return;
}
if (selectedLanguages.length === 0) {
setErrorLanguagesMessage(&quot;Select any language.&quot;);
return;
}
setIsLoading(true);
setProgress(0);
setTexts([]);
setCurrentImageIndex(0);
setErrorMessage(&quot;&quot;);
setErrorLanguagesMessage(&quot;&quot;);
const totalImages = images.length;
let processedImages = 0;
if (Array.isArray(images)) {
for (const [index, image] of images?.entries()) {
setCurrentImageIndex(index + 1);
try {
const result = await Tesseract.recognize(
image,
selectedLanguages.map((lang) =&gt; lang.value).join(&quot;+&quot;)
);
const paragraphs = result.data.text.split(&quot;\n\n&quot;);
const formattedParagraphs = paragraphs.map((paragraph) =&gt; {
const sentences = paragraph.split(/[.|?]\s/);
return sentences.join(&quot; &quot;);
});
setTexts((prevTexts) =&gt; [...prevTexts, ...formattedParagraphs]);
} catch (err) {
console.error(err);
// Clear texts and stop conversion process immediately on error
setTexts([]);
setProgress(0);
setIsLoading(false);
return;
} finally {
processedImages++;
const currentProgress = (processedImages / totalImages) * 100;
setProgress(currentProgress);
}
}
} else {
console.error(&quot;Images is not an array.&quot;);
}
setIsLoading(false);
};
return (
&lt;div className=&quot;container&quot; style={{ height: &quot;97vh&quot; }}&gt;
&lt;div className=&quot;row h-100 mt-3&quot;&gt;
&lt;div className=&quot;col-md-3 left-bar sticky-top border 1 ms-2&quot;&gt;
&lt;h1 className=&quot;center py-3 mc-5 underline&quot;&gt;Images to text (ocr)&lt;/h1&gt;
&lt;input
type=&quot;file&quot;
onChange={handleImageUpload}
className=&quot;form-control mt-5 mb-2&quot;
multiple
accept=&quot;image/*&quot;
/&gt;
{errorMessage &amp;&amp; &lt;p className=&quot;text-danger&quot;&gt;{errorMessage}&lt;/p&gt;}
&lt;Select
isMulti
options={languageOptions}
value={selectedLanguages}
onChange={setSelectedLanguages}
placeholder=&quot;Select languages...&quot;
/&gt;
{errorLanguagesMessage &amp;&amp; (
&lt;p className=&quot;text-danger&quot;&gt;{errorLanguagesMessage}&lt;/p&gt;
)}
&lt;input
type=&quot;button&quot;
onClick={handleSubmit}
className=&quot;btn btn-outline-success mt-3&quot;
value=&quot;Start Convert&quot;
/&gt;
{texts.length &gt; 0 &amp;&amp; (
&lt;button
className=&quot;btn btn-primary mt-3 ms-1&quot;
onClick={handleDownloadText}
&gt;
Download Text
&lt;/button&gt;
)}
&lt;div className=&quot;mt-1&quot;&gt;
&lt;button className=&quot; btn ml-2 btn-danger&quot; onClick={handleReset}&gt;
Reset
&lt;/button&gt;
&lt;button
className=&quot;mt-3 btn btn-secondary d-inline ms-1 &quot;
onClick={handleCopyText}
&gt;
Copy Text
&lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div className=&quot;col-md-8 right-bar border 1 ms-2&quot;&gt;
&lt;h4 className=&quot;mt-5 text-center&quot;&gt;Select an Image to convert (ocr)&lt;/h4&gt;
{isLoading &amp;&amp; (
&lt;div className=&quot;text-center&quot;&gt;
&lt;div className=&quot;text-center&quot;&gt;
&lt;progress
className=&quot;custom-progress-bar&quot;
value={progress}
max=&quot;100&quot;
&gt;&lt;/progress&gt;
&lt;p className=&quot;text-center py-0 my-0&quot;&gt;
Converting...: {progress.toFixed(0)}% ({currentImageIndex} of{&quot; &quot;}
{images.length})
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
)}
{!isLoading &amp;&amp; texts.length &gt; 0 &amp;&amp; (
&lt;div&gt;
&lt;div className=&quot;form-control box-p w-100 mt-5 m-none&quot;&gt;
{texts.map((paragraph, index) =&gt; (
&lt;p key={index}&gt;{paragraph}&lt;/p&gt;
))}
&lt;/div&gt;
&lt;/div&gt;
)}
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
);
};
export default ImagesToText;

I tried with opencv.js but I can't solve it.

答案1

得分: 1

尝试使用不同的页面分割模式（PSM）。默认情况下，TesseractJS 使用PSM_SINGLE_BLOCK，假定文本是作为单一统一块出现的，但这里情况并非如此。我建议尝试使用PSM_AUTO_OSD，看看你能得到什么结果，并进一步尝试其他PSM。

我在你的文档中使用了PSM_AUTO_OSD，发现它打印了第一列（左边），然后是第二列（右边）。

英文:

Try utilizing a different Page Segmentation Mode (PSM). By default, TesseractJS utilizes PSM_SINGLE_BLOCK, which assumes that the text is coming as a single uniform block, which is not the case here. I'd recommend trying PSM_AUTO_OSD and see what results you get, and experiment further with the other PSMs.

I utilized PSM_AUTO_OSD with your document and found that it printed column 1 (left), followed by column 2 (right).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Tesseract.js OCR将两/三列图像转换为文本？

问题

答案1

使用ChakraUI中的`useClipboard`钩子时，如何测试粘贴内容。

如何使用JavaScript筛选和插入HTML元素。

Tried to use state variable inside another state variable, but it didn't work as expected it to, Why ? (in ReactJS using function component)

动态 ID 渲染在 Next.js 13.4.6 中不起作用。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论