英文:
How to automatically extract data from large texts with queries
问题
我有大量的PDF文件(数百页的法语),描述了我所在行业的一套规则。
我正在寻找一种服务,允许我逐个查询PDF文件(或从中提取的文本),以便自动获取信息。
(示例:X的最大允许长度是多少?)
我查看了OpenAI的ChatGPT,并遇到了最大标记的问题,因为如前所述,文本非常庞大。
我查看了Amazon的Textract,它具有查询系统,但似乎是为图像处理而构建的,因此将文本转换为图像似乎不是最佳选择,特别是因为图像需要非常大(我尚未找到将这些PDF合并成一个非常大图像而不会遇到内存问题的软件,而且我相当肯定Textract 无法处理这些)。
我查看了在线的其他解决方案,但似乎没有什么能满足我的大文本需求和复杂查询的需求。
英文:
I have large pdf files (100s of pages in French) that describe a set of rules for my sector of activity.
I am looking for a service that would allow me to query the pdfs (or the text I extract from them) one at a time to get the information automatically.
(Example: What is the maximum authorized length of x ?)
I looked at openAI's chatGPT and ran into maximum tokens problems because as said previously the texts are huge.
I looked at Amazon's Textract that does have a query system but it seems built for image treatment so it wouldn't seem optimal to transform my text into images especially since the images would need to be very big (I couldn't yet find software to merge those pdfs into one very very large image without running into memory issues, and I'm pretty certain Textract could not handle those).
I looked at other solutions online but nothing seemed to answer to my large text needs combined with complex queries.
答案1
得分: 1
Amazon Textract支持将PDF作为输入,因此您无需将PDF转换为文本然后再转换为图像。
PDF和TIFF文件有500 MB的限制。PDF和TIFF文件有3,000页的限制。
以下是使用Textract查询的教程。要在多页文档上使用,您需要使用异步API,使用 .start_document_analysis
。有关详细代码,请参考以下内容:
from textractor import Textractor
from textractor.data.constants import TextractFeatures
from textractcaller import QueriesConfig, Query
extractor = Textractor(profile_name="default")
document1 = extractor.start_document_analysis(
file_source='./multipage.pdf',
features=[TextractFeatures.QUERIES],
s3_upload_path='<YOUR_S3_BUCKET>',
s3_output_path='<YOUR_S3_BUCKET>',
save_image=True,
queries=QueriesConfig([Query("What is the first row value")])
)
document1.queries[0].result
0.129853474
(Note: Code portions and specific text within code are not translated.)
英文:
Amazon Textract support PDFs as input so you wouldn't need to convert your pdfs to text and back to images.
PDF and TIFF files have a 500 MB limit. PDF and TIFF files have a limit of 3,000 pages.
Here is a tutorial to use queries with Textract. For using with multi-page you need to use the asynchronous API using .start_document_analysis
https://aws-samples.github.io/amazon-textract-textractor/notebooks/using_queries.html
The relevant code is here:
from textractor import Textractor
from textractor.data.constants import TextractFeatures
from textractcaller import QueriesConfig, Query
extractor = Textractor(profile_name="default")
document1 = extractor.start_document_analysis(
file_source='./multipage.pdf',
features=[TextractFeatures.QUERIES],
s3_upload_path='<YOUR_S3_BUCKET>',
s3_output_path='<YOUR_S3_BUCKET>',
save_image=True,
queries=QueriesConfig([Query("What is the first row value")])
)
document1.queries[0].result
0.129853474
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论