2023年2月23日 21:25:58go评论81阅读模式

英文:

How to automatically extract data from large texts with queries

问题

我有大量的PDF文件（数百页的法语），描述了我所在行业的一套规则。

我正在寻找一种服务，允许我逐个查询PDF文件（或从中提取的文本），以便自动获取信息。

（示例：X的最大允许长度是多少？）

我查看了OpenAI的ChatGPT，并遇到了最大标记的问题，因为如前所述，文本非常庞大。

我查看了Amazon的Textract，它具有查询系统，但似乎是为图像处理而构建的，因此将文本转换为图像似乎不是最佳选择，特别是因为图像需要非常大（我尚未找到将这些PDF合并成一个非常大图像而不会遇到内存问题的软件，而且我相当肯定Textract 无法处理这些）。

我查看了在线的其他解决方案，但似乎没有什么能满足我的大文本需求和复杂查询的需求。

英文:

I have large pdf files (100s of pages in French) that describe a set of rules for my sector of activity.

I am looking for a service that would allow me to query the pdfs (or the text I extract from them) one at a time to get the information automatically.

(Example: What is the maximum authorized length of x ?)

I looked at openAI's chatGPT and ran into maximum tokens problems because as said previously the texts are huge.

I looked at Amazon's Textract that does have a query system but it seems built for image treatment so it wouldn't seem optimal to transform my text into images especially since the images would need to be very big (I couldn't yet find software to merge those pdfs into one very very large image without running into memory issues, and I'm pretty certain Textract could not handle those).

I looked at other solutions online but nothing seemed to answer to my large text needs combined with complex queries.

答案1

得分: 1

Amazon Textract支持将PDF作为输入，因此您无需将PDF转换为文本然后再转换为图像。

PDF和TIFF文件有500 MB的限制。PDF和TIFF文件有3,000页的限制。

以下是使用Textract查询的教程。要在多页文档上使用，您需要使用异步API，使用 .start_document_analysis。有关详细代码，请参考以下内容：

from textractor import Textractor
from textractor.data.constants import TextractFeatures
from textractcaller import QueriesConfig, Query

extractor = Textractor(profile_name="default")

document1 = extractor.start_document_analysis(    
    file_source='./multipage.pdf',    
    features=[TextractFeatures.QUERIES],
    s3_upload_path='<YOUR_S3_BUCKET>',
    s3_output_path='<YOUR_S3_BUCKET>',
    save_image=True,
    queries=QueriesConfig([Query("What is the first row value")])
)
document1.queries[0].result

0.129853474

(Note: Code portions and specific text within code are not translated.)

英文:

Amazon Textract support PDFs as input so you wouldn't need to convert your pdfs to text and back to images.

 PDF and TIFF files have a 500 MB limit. PDF and TIFF files have a limit of 3,000 pages.

Here is a tutorial to use queries with Textract. For using with multi-page you need to use the asynchronous API using .start_document_analysis https://aws-samples.github.io/amazon-textract-textractor/notebooks/using_queries.html

The relevant code is here:

from textractor import Textractor
from textractor.data.constants import TextractFeatures
from textractcaller import QueriesConfig, Query

extractor = Textractor(profile_name=&quot;default&quot;)

document1 = extractor.start_document_analysis(    
    file_source=&#39;./multipage.pdf&#39;,    
    features=[TextractFeatures.QUERIES],
    s3_upload_path=&#39;&lt;YOUR_S3_BUCKET&gt;&#39;,
    s3_output_path=&#39;&lt;YOUR_S3_BUCKET&gt;&#39;,
    save_image=True,
    queries=QueriesConfig([Query(&quot;What is the first row value&quot;)])
)
document1.queries[0].result

0.129853474

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用查询自动从大文本中提取数据

问题

答案1

How do i add memory to RetrievalQA.from_chain_type? or, how do I add a custom prompt to ConversationalRetrievalChain?

如何使我的fetchMovieDescription函数在story状态更改后被调用？

OpenAI GPT-3 API: 如何使模型记住以前的对话？

如何将“是”的命令作为对这个问题的答案写入？谢谢。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论