如何使用Tesseract.js OCR将两/三列图像转换为文本?

huangapple go评论82阅读模式
英文:

how to convert two/there column images to text with ( tesseract.js ocr)?

问题

我正在进行一个React.js项目,几乎已经完成了,但我的问题是,如果我想使用Tesseract(OCR)将两列/三列图像转换为文本,它不会按我想要的方式进行转换。因为两列的文本混在一起了,无法按列单独转换。有没有可能以某种方式解决这个问题?

我尝试过使用opencv.js,但无法解决它。

英文:

I am working on a react.js project, I have almost done but my problem is if I want to convert two/three column images to text by Tesseract (OCR) does not convert as I want. because two columns' text is mixed. no separately convert by column. can possibel to solve this problem anyway?

enter image description here

  1. import React, { useState, useEffect } from "react";
  2. import Tesseract from "tesseract.js";
  3. import ClipboardJS from "clipboard";
  4. import Select from "react-select";
  5. const languageOptions = [
  6. { value: "afr", label: "Afrikaans" },
  7. { value: "amh", label: "Amharic" },
  8. { value: "ara", label: "Arabic" },
  9. { value: "asm", label: "Assamese" },
  10. { value: "aze", label: "Azerbaijani" },
  11. { value: "aze_cyrl", label: "Azerbaijani - Cyrillic" },
  12. { value: "bel", label: "Belarusian" },
  13. { value: "ben", label: "Bengali" },
  14. { value: "bod", label: "Tibetan" },
  15. { value: "bos", label: "Bosnian" },
  16. { value: "bul", label: "Bulgarian" },
  17. { value: "cat", label: "Catalan; Valencian" },
  18. { value: "ceb", label: "Cebuano" },
  19. { value: "ces", label: "Czech" },
  20. { value: "chi_sim", label: "Chinese - Simplified" },
  21. { value: "chi_tra", label: "Chinese - Traditional" },
  22. { value: "chr", label: "Cherokee" },
  23. { value: "cym", label: "Welsh" },
  24. { value: "dan", label: "Danish" },
  25. { value: "deu", label: "German" },
  26. { value: "dzo", label: "Dzongkha" },
  27. { value: "ell", label: "Greek, Modern (1453-)" },
  28. { value: "eng", label: "English" },
  29. { value: "enm", label: "English, Middle (1100-1500)" },
  30. { value: "epo", label: "Esperanto" },
  31. { value: "est", label: "Estonian" },
  32. { value: "eus", label: "Basque" },
  33. { value: "fas", label: "Persian" },
  34. { value: "fin", label: "Finnish" },
  35. { value: "fra", label: "French" },
  36. { value: "frk", label: "German Fraktur" },
  37. { value: "frm", label: "French, Middle (ca. 1400-1600)" },
  38. { value: "gle", label: "Irish" },
  39. { value: "glg", label: "Galician" },
  40. { value: "grc", label: "Greek, Ancient (-1453)" },
  41. { value: "guj", label: "Gujarati" },
  42. { value: "hat", label: "Haitian; Haitian Creole" },
  43. { value: "heb", label: "Hebrew" },
  44. { value: "hin", label: "Hindi" },
  45. { value: "hrv", label: "Croatian" },
  46. { value: "hun", label: "Hungarian" },
  47. { value: "iku", label: "Inuktitut" },
  48. { value: "ind", label: "Indonesian" },
  49. { value: "isl", label: "Icelandic" },
  50. { value: "ita", label: "Italian" },
  51. { value: "ita_old", label: "Italian - Old" },
  52. { value: "jav", label: "Javanese" },
  53. { value: "jpn", label: "Japanese" },
  54. { value: "kan", label: "Kannada" },
  55. { value: "kat", label: "Georgian" },
  56. { value: "kat_old", label: "Georgian - Old" },
  57. { value: "kaz", label: "Kazakh" },
  58. { value: "khm", label: "Central Khmer" },
  59. { value: "kir", label: "Kirghiz; Kyrgyz" },
  60. { value: "kor", label: "Korean" },
  61. { value: "kur", label: "Kurdish" },
  62. { value: "lao", label: "Lao" },
  63. { value: "lat", label: "Latin" },
  64. { value: "lav", label: "Latvian" },
  65. { value: "lit", label: "Lithuanian" },
  66. { value: "mal", label: "Malayalam" },
  67. { value: "mar", label: "Marathi" },
  68. { value: "mkd", label: "Macedonian" },
  69. { value: "mlt", label: "Maltese" },
  70. { value: "msa", label: "Malay" },
  71. { value: "mya", label: "Burmese" },
  72. { value: "nep", label: "Nepali" },
  73. { value: "nld", label: "Dutch; Flemish" },
  74. { value: "nor", label: "Norwegian" },
  75. { value: "ori", label: "Oriya" },
  76. { value: "pan", label: "Panjabi; Punjabi" },
  77. { value: "pol", label: "Polish" },
  78. { value: "por", label: "Portuguese" },
  79. { value: "pus", label: "Pushto; Pashto" },
  80. { value: "ron", label: "Romanian; Moldavian; Moldovan" },
  81. { value: "rus", label: "Russian" },
  82. { value: "san", label: "Sanskrit" },
  83. { value: "sin", label: "Sinhala; Sinhalese" },
  84. { value: "slk", label: "Slovak" },
  85. { value: "slv", label: "Slovenian" },
  86. { value: "spa", label: "Spanish; Castilian" },
  87. { value: "spa_old", label: "Spanish; Castilian - Old" },
  88. { value: "sqi", label: "Albanian" },
  89. { value: "srp", label: "Serbian" },
  90. { value: "srp_latn", label: "Serbian - Latin" },
  91. { value: "swa", label: "Swahili" },
  92. { value: "swe", label: "Swedish" },
  93. { value: "syr", label: "Syriac" },
  94. { value: "tam", label: "Tamil" },
  95. { value: "tel", label: "Telugu" },
  96. { value: "tgk", label: "Tajik" },
  97. { value: "tgl", label: "Tagalog" },
  98. { value: "tha", label: "Thai" },
  99. { value: "tir", label: "Tigrinya" },
  100. { value: "tur", label: "Turkish" },
  101. { value: "uig", label: "Uighur; Uyghur" },
  102. { value: "ukr", label: "Ukrainian" },
  103. { value: "urd", label: "Urdu" },
  104. { value: "uzb", label: "Uzbek" },
  105. { value: "uzb_cyrl", label: "Uzbek - Cyrillic" },
  106. { value: "vie", label: "Vietnamese" },
  107. { value: "yid", label: "Yiddish" }
  108. ];
  109. const ImagesToText = () => {
  110. const [isLoading, setIsLoading] = useState(false);
  111. const [images, setImages] = useState([]);
  112. const [texts, setTexts] = useState([]);
  113. const [progress, setProgress] = useState(0);
  114. const [currentImageIndex, setCurrentImageIndex] = useState(0);
  115. const [errorMessage, setErrorMessage] = useState("");
  116. const [errorLanguagesMessage, setErrorLanguagesMessage] = useState("");
  117. const [selectedLanguages, setSelectedLanguages] = useState([]);
  118. const handleImageUpload = (e) => {
  119. const selectedImages = Array.from(e.target.files);
  120. setImages(selectedImages);
  121. setErrorMessage("");
  122. };
  123. const handleCopyText = () => {
  124. const textWithSoftLineBreaks = texts.join("\n");
  125. navigator.clipboard.writeText(textWithSoftLineBreaks);
  126. };
  127. const handleDownloadText = () => {
  128. const element = document.createElement("a");
  129. const textBlob = new Blob([texts.join("\n")], { type: "text/plain" });
  130. element.href = URL.createObjectURL(textBlob);
  131. element.download = "converted_text.txt";
  132. document.body.appendChild(element);
  133. element.click();
  134. document.body.removeChild(element);
  135. };
  136. useEffect(() => {
  137. const clipboard = new ClipboardJS(".copy-button");
  138. clipboard.on("success", (e) => {
  139. e.clearSelection();
  140. });
  141. return () => {
  142. clipboard.destroy();
  143. };
  144. }, [texts]);
  145. const handleReset = () => {
  146. setIsLoading(false);
  147. setImages([]);
  148. setTexts([]);
  149. setProgress(0);
  150. setCurrentImageIndex(0);
  151. setErrorMessage("");
  152. setErrorLanguagesMessage("");
  153. window.location.reload();
  154. };
  155. const handleSubmit = async () => {
  156. if (images.length === 0) {
  157. setErrorMessage("Select an image to convert.");
  158. return;
  159. }
  160. if (selectedLanguages.length === 0) {
  161. setErrorLanguagesMessage("Select any language.");
  162. return;
  163. }
  164. setIsLoading(true);
  165. setProgress(0);
  166. setTexts([]);
  167. setCurrentImageIndex(0);
  168. setErrorMessage("");
  169. setErrorLanguagesMessage("");
  170. const totalImages = images.length;
  171. let processedImages = 0;
  172. if (Array.isArray(images)) {
  173. for (const [index, image] of images?.entries()) {
  174. setCurrentImageIndex(index + 1);
  175. try {
  176. const result = await Tesseract.recognize(
  177. image,
  178. selectedLanguages.map((lang) => lang.value).join("+")
  179. );
  180. const paragraphs = result.data.text.split("\n\n");
  181. const formattedParagraphs = paragraphs.map((paragraph) => {
  182. const sentences = paragraph.split(/[.|?]\s/);
  183. return sentences.join(" ");
  184. });
  185. setTexts((prevTexts) => [...prevTexts, ...formattedParagraphs]);
  186. } catch (err) {
  187. console.error(err);
  188. // Clear texts and stop conversion process immediately on error
  189. setTexts([]);
  190. setProgress(0);
  191. setIsLoading(false);
  192. return;
  193. } finally {
  194. processedImages++;
  195. const currentProgress = (processedImages / totalImages) * 100;
  196. setProgress(currentProgress);
  197. }
  198. }
  199. } else {
  200. console.error("Images is not an array.");
  201. }
  202. setIsLoading(false);
  203. };
  204. return (
  205. <div className="container" style={{ height: "97vh" }}>
  206. <div className="row h-100 mt-3">
  207. <div className="col-md-3 left-bar sticky-top border 1 ms-2">
  208. <h1 className="center py-3 mc-5 underline">Images to text (ocr)</h1>
  209. <input
  210. type="file"
  211. onChange={handleImageUpload}
  212. className="form-control mt-5 mb-2"
  213. multiple
  214. accept="image/*"
  215. />
  216. {errorMessage && <p className="text-danger">{errorMessage}</p>}
  217. <Select
  218. isMulti
  219. options={languageOptions}
  220. value={selectedLanguages}
  221. onChange={setSelectedLanguages}
  222. placeholder="Select languages..."
  223. />
  224. {errorLanguagesMessage && (
  225. <p className="text-danger">{errorLanguagesMessage}</p>
  226. )}
  227. <input
  228. type="button"
  229. onClick={handleSubmit}
  230. className="btn btn-outline-success mt-3"
  231. value="Start Convert"
  232. />
  233. {texts.length > 0 && (
  234. <button
  235. className="btn btn-primary mt-3 ms-1"
  236. onClick={handleDownloadText}
  237. >
  238. Download Text
  239. </button>
  240. )}
  241. <div className="mt-1">
  242. <button className=" btn ml-2 btn-danger" onClick={handleReset}>
  243. Reset
  244. </button>
  245. <button
  246. className="mt-3 btn btn-secondary d-inline ms-1 "
  247. onClick={handleCopyText}
  248. >
  249. Copy Text
  250. </button>
  251. </div>
  252. </div>
  253. <div className="col-md-8 right-bar border 1 ms-2">
  254. <h4 className="mt-5 text-center">Select an Image to convert (ocr)</h4>
  255. {isLoading && (
  256. <div className="text-center">
  257. <div className="text-center">
  258. <progress
  259. className="custom-progress-bar"
  260. value={progress}
  261. max="100"
  262. ></progress>
  263. <p className="text-center py-0 my-0">
  264. Converting...: {progress.toFixed(0)}% ({currentImageIndex} of{" "}
  265. {images.length})
  266. </p>
  267. </div>
  268. </div>
  269. )}
  270. {!isLoading && texts.length > 0 && (
  271. <div>
  272. <div className="form-control box-p w-100 mt-5 m-none">
  273. {texts.map((paragraph, index) => (
  274. <p key={index}>{paragraph}</p>
  275. ))}
  276. </div>
  277. </div>
  278. )}
  279. </div>
  280. </div>
  281. </div>
  282. );
  283. };
  284. export default ImagesToText;

I tried with opencv.js but I can't solve it.

答案1

得分: 1

尝试使用不同的页面分割模式(PSM)。默认情况下,TesseractJS 使用PSM_SINGLE_BLOCK,假定文本是作为单一统一块出现的,但这里情况并非如此。我建议尝试使用PSM_AUTO_OSD,看看你能得到什么结果,并进一步尝试其他PSM。

我在你的文档中使用了PSM_AUTO_OSD,发现它打印了第一列(左边),然后是第二列(右边)。

英文:

Try utilizing a different Page Segmentation Mode (PSM). By default, TesseractJS utilizes PSM_SINGLE_BLOCK, which assumes that the text is coming as a single uniform block, which is not the case here. I'd recommend trying PSM_AUTO_OSD and see what results you get, and experiment further with the other PSMs.

I utilized PSM_AUTO_OSD with your document and found that it printed column 1 (left), followed by column 2 (right).

huangapple
  • 本文由 发表于 2023年6月19日 02:29:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76502022.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定