如何使用查询自动从大文本中提取数据

huangapple go评论65阅读模式
英文:

How to automatically extract data from large texts with queries

问题

我有大量的PDF文件(数百页的法语),描述了我所在行业的一套规则。

我正在寻找一种服务,允许我逐个查询PDF文件(或从中提取的文本),以便自动获取信息。

(示例:X的最大允许长度是多少?)

我查看了OpenAI的ChatGPT,并遇到了最大标记的问题,因为如前所述,文本非常庞大。

我查看了Amazon的Textract,它具有查询系统,但似乎是为图像处理而构建的,因此将文本转换为图像似乎不是最佳选择,特别是因为图像需要非常大(我尚未找到将这些PDF合并成一个非常大图像而不会遇到内存问题的软件,而且我相当肯定Textract 无法处理这些)。

我查看了在线的其他解决方案,但似乎没有什么能满足我的大文本需求和复杂查询的需求。

英文:

I have large pdf files (100s of pages in French) that describe a set of rules for my sector of activity.

I am looking for a service that would allow me to query the pdfs (or the text I extract from them) one at a time to get the information automatically.

(Example: What is the maximum authorized length of x ?)

I looked at openAI's chatGPT and ran into maximum tokens problems because as said previously the texts are huge.

I looked at Amazon's Textract that does have a query system but it seems built for image treatment so it wouldn't seem optimal to transform my text into images especially since the images would need to be very big (I couldn't yet find software to merge those pdfs into one very very large image without running into memory issues, and I'm pretty certain Textract could not handle those).

I looked at other solutions online but nothing seemed to answer to my large text needs combined with complex queries.

答案1

得分: 1

Amazon Textract支持将PDF作为输入,因此您无需将PDF转换为文本然后再转换为图像。

PDF和TIFF文件有500 MB的限制。PDF和TIFF文件有3,000页的限制。

以下是使用Textract查询的教程。要在多页文档上使用,您需要使用异步API,使用 .start_document_analysis。有关详细代码,请参考以下内容:

from textractor import Textractor
from textractor.data.constants import TextractFeatures
from textractcaller import QueriesConfig, Query

extractor = Textractor(profile_name="default")

document1 = extractor.start_document_analysis(    
    file_source='./multipage.pdf',    
    features=[TextractFeatures.QUERIES],
    s3_upload_path='<YOUR_S3_BUCKET>',
    s3_output_path='<YOUR_S3_BUCKET>',
    save_image=True,
    queries=QueriesConfig([Query("What is the first row value")])
)
document1.queries[0].result
0.129853474

(Note: Code portions and specific text within code are not translated.)

英文:

Amazon Textract support PDFs as input so you wouldn't need to convert your pdfs to text and back to images.

 PDF and TIFF files have a 500 MB limit. PDF and TIFF files have a limit of 3,000 pages.

Here is a tutorial to use queries with Textract. For using with multi-page you need to use the asynchronous API using .start_document_analysis https://aws-samples.github.io/amazon-textract-textractor/notebooks/using_queries.html

The relevant code is here:

from textractor import Textractor
from textractor.data.constants import TextractFeatures
from textractcaller import QueriesConfig, Query

extractor = Textractor(profile_name=&quot;default&quot;)

document1 = extractor.start_document_analysis(    
    file_source=&#39;./multipage.pdf&#39;,    
    features=[TextractFeatures.QUERIES],
    s3_upload_path=&#39;&lt;YOUR_S3_BUCKET&gt;&#39;,
    s3_output_path=&#39;&lt;YOUR_S3_BUCKET&gt;&#39;,
    save_image=True,
    queries=QueriesConfig([Query(&quot;What is the first row value&quot;)])
)
document1.queries[0].result
0.129853474

huangapple
  • 本文由 发表于 2023年2月23日 21:25:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75545480.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定