2023年4月17日 00:11:55go评论98阅读模式

英文:

Convert image data frame into pandas data frame

问题

我有图像（jpg档案）和数据帧（可以想象成我有Excel电子表格的屏幕截图），我想将其转换为Pandas数据帧。

有人有建议我如何完成这个任务吗？

谢谢！

我考虑使用类似OCR的方法，但我不知道如何在Python中实现这一点。

英文:

I have images (jpg archives) with dataframes (think that I have screenshots of an excel spreadsheet) and I want to transform it to a pandas data frame.

Does anyone has a suggestion of how could I accomplish this

Thanks

I am thinking in using something like OCR, but I do not know how could I implement this in python

答案1

得分: 2

你可以考虑使用Python中的OCR库pytesseract来帮助你从JPG档案中提取这些数据。请注意，有许多可以在Python中实现的OCR库，但我建议你首先尝试使用pytesseract，因为它可以激活一种称为**Page Segmentation Mode (PSM)**的功能，以提高OCR的准确性，通过指定如何处理提供的图像来实现。以下是一个代码示例：

# 使用页面分段模式（psm）配置实现OCR的代码示例
target = pytesseract.image_to_string(image, lang='eng', config='--psm 10')

要在Python中实现这个功能，你可以参考这篇中等博客文章 Image Table to DataFrame using Python OCR，它会逐步解释如何将图像数据（在这种情况下，你的jpg档案）转换为数据框。我认为这是你在寻找的内容。但是，请注意，关于页面分段模式选项，你必须自行选择适合你的数据格式的选项。你可以在Tesseract Page Segmentation Modes (PSMs) Explained: How to Improve Your OCR Accuracy中详细了解每个psm选项。

为了最大化OCR的性能，重要的是在执行OCR之前应用图像处理技术以提高图像质量。简单的库，如Pillow或OpenCV，可用于此目的。最好从清晰的图像中提取数据，而不是模糊的图像，因为这可以显著提高OCR的准确性。因此，在进行OCR之前考虑图像增强是至关重要的。应注意，数据质量对OCR性能有直接影响，因此数据质量越好，OCR准确性就越高。

祝你好运！

英文:

You may consider pytesseract, OCR library in python, to help you extract those data out from your JPG archives.

Note that there has many OCR library that can be implemented in python, however, i recommend you to try out pytesseract first since it can activate something called Page Segmentation Mode (PSM) in order to improve your ocr accuracy by specify how to treat an image provided. Here is a code sample :

# code sample of implementing OCR with page segmentation mode (psm) config
target = pytesseract.image_to_string(image, lang=&#39;eng&#39;, config=&#39;--psm 10&#39;)

To implement it in python, you may follow this medium blog Image Table to DataFrame using Python OCR which will explain you step-by-step to transform your image data (in this case, your jpg archives) into dataframe. I think that is the one that you are looking for. However, for the page segmentation mode option, you must choose one that suited with your data format by yourselves. You can take a look for detailed of each psm options at Tesseract Page Segmentation Modes (PSMs) Explained: How to Improve Your OCR Accuracy

To maximize the performance of OCR, it is important to note that image processing techniques should be applied to enhance the image quality before performing OCR. Simple libraries such as Pillow or OpenCV can be used for this purpose. It is preferable to extract data from a sharp image rather than a blurred one, as this can significantly improve OCR accuracy. Therefore, it is crucial to consider image enhancement as a priority. It should be noted that the quality of data has a direct impact on OCR performance, so the better the data quality, the higher the OCR accuracy will be.

Good luck !

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将图像数据框转换为pandas数据框。

问题

答案1

在Kubernetes/OpenShift中有没有指向kubeconfig文件的命令？

Vertex AI Pipelines. 批量预测 ‘错误状态: 5。’

Python 权限访问被拒绝。

在函数中更改变量，但它返回原始值。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。