将图像数据框转换为pandas数据框。

huangapple go评论78阅读模式
英文:

Convert image data frame into pandas data frame

问题

我有图像(jpg档案)和数据帧(可以想象成我有Excel电子表格的屏幕截图),我想将其转换为Pandas数据帧。

有人有建议我如何完成这个任务吗?

谢谢!

我考虑使用类似OCR的方法,但我不知道如何在Python中实现这一点。

英文:

I have images (jpg archives) with dataframes (think that I have screenshots of an excel spreadsheet) and I want to transform it to a pandas data frame.

Does anyone has a suggestion of how could I accomplish this

Thanks

I am thinking in using something like OCR, but I do not know how could I implement this in python

答案1

得分: 2

你可以考虑使用Python中的OCR库pytesseract来帮助你从JPG档案中提取这些数据。请注意,有许多可以在Python中实现的OCR库,但我建议你首先尝试使用pytesseract,因为它可以激活一种称为**Page Segmentation Mode (PSM)**的功能,以提高OCR的准确性,通过指定如何处理提供的图像来实现。以下是一个代码示例:

# 使用页面分段模式(psm)配置实现OCR的代码示例
target = pytesseract.image_to_string(image, lang='eng', config='--psm 10')

要在Python中实现这个功能,你可以参考这篇中等博客文章 Image Table to DataFrame using Python OCR,它会逐步解释如何将图像数据(在这种情况下,你的jpg档案)转换为数据框。我认为这是你在寻找的内容。但是,请注意,关于页面分段模式选项,你必须自行选择适合你的数据格式的选项。你可以在Tesseract Page Segmentation Modes (PSMs) Explained: How to Improve Your OCR Accuracy中详细了解每个psm选项。

为了最大化OCR的性能,重要的是在执行OCR之前应用图像处理技术以提高图像质量。简单的库,如Pillow或OpenCV,可用于此目的。最好从清晰的图像中提取数据,而不是模糊的图像,因为这可以显著提高OCR的准确性。因此,在进行OCR之前考虑图像增强是至关重要的。应注意,数据质量对OCR性能有直接影响,因此数据质量越好,OCR准确性就越高

祝你好运!

英文:

You may consider pytesseract, OCR library in python, to help you extract those data out from your JPG archives.

Note that there has many OCR library that can be implemented in python, however, i recommend you to try out pytesseract first since it can activate something called Page Segmentation Mode (PSM) in order to improve your ocr accuracy by specify how to treat an image provided. Here is a code sample :

# code sample of implementing OCR with page segmentation mode (psm) config
target = pytesseract.image_to_string(image, lang='eng', config='--psm 10')

To implement it in python, you may follow this medium blog Image Table to DataFrame using Python OCR which will explain you step-by-step to transform your image data (in this case, your jpg archives) into dataframe. I think that is the one that you are looking for. However, for the page segmentation mode option, you must choose one that suited with your data format by yourselves. You can take a look for detailed of each psm options at Tesseract Page Segmentation Modes (PSMs) Explained: How to Improve Your OCR Accuracy

To maximize the performance of OCR, it is important to note that image processing techniques should be applied to enhance the image quality before performing OCR. Simple libraries such as Pillow or OpenCV can be used for this purpose. It is preferable to extract data from a sharp image rather than a blurred one, as this can significantly improve OCR accuracy. Therefore, it is crucial to consider image enhancement as a priority. It should be noted that the quality of data has a direct impact on OCR performance, so the better the data quality, the higher the OCR accuracy will be.

Good luck !

huangapple
  • 本文由 发表于 2023年4月17日 00:11:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76028915.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定