英文:
How can I extract separated content from questions in a PDF of the ENEM (brazilian exam)?
问题
我想提取一份考试的问题,以建立一个数据集。这里有一个我正在处理的ENEM考试示例页面:
这是2022年版的第4页,可以在这里的“microdados_enem_2022/PROVAS E GABARTIOS/ENEM_2022_P1_CAD_01_DIA_1_AZUL.pdf”目录中找到。
这是考试中普通页面的经典示例,在这个特定情况下,我选择了一个没有问题图像并且所有问题都在一页中的页面,以使提取更容易。此外,所需内容以颜色区分开来。因此,目标是生成一个数据集,其中包含问题列表,每个问题都具有以下特征:
- 文本(黄色)
- 命令或陈述(绿色)
- 选项(蓝色)
我该如何从这个考试中提取这些特征以生成数据集?
我正在尝试使用Python的PyPDF库,但我有些难以确定如何处理提取的文本以生成数据集。以下是目前的代码:
from PyPDF2 import PdfReader
# 打开读取器
reader = PdfReader("ENEM_2022_P1_CAD_01_DIA_1_AZUL.pdf")
parts = []
# 定义访问函数
def visitor_question(text, cm, tm, fontDict, fontSize):
y = tm[5]
if y > 50 and y < 720:
parts.append(text)
# 选择页面
page_index = 3 # 选择页码,从0开始
page = reader.pages[page_index]
# 提取文本
page.extract_text(visitor_text=visitor_question)
# 打印文本
text_body = "".join(parts)
print(text_body)
请注意,以上代码中的注释已经翻译成中文。
英文:
I want to extract the questions of an exam for building a dataset. Here we have an example page of the ENEM, the specific exam I am working with:
Page 4 - ENEM 2022 (Day 1 / Blue)
This is the page 4 of 2022 edition, available here in "microdados_enem_2022/PROVAS E GABARTIOS/ENEM_2022_P1_CAD_01_DIA_1_AZUL.pdf" directory.
This is the classical example of a normal page in the exam, in this specific case, I selected a page with no image in the questions and with all the questions in only one page to make it easier. Besides that, the desired content is colored to separate what is what. So, the objective is to generate a dataset with a list of questions, each one with the features:
- The text (in yellow)
- The command or statement (in green)
- The alternatives (in blue)
How can I extract this features for generate dataset from this exam?
I'm trying to use the PyPDF library for Python, but I'm having some difficult to know how to process the extracted text to generate the dataset. Here is the code at the moment:
from PyPDF2 import PdfReader
# Open reader
reader = PdfReader("ENEM_2022_P1_CAD_01_DIA_1_AZUL.pdf")
parts = []
# Defining visitor function
def visitor_question(text, cm, tm, fontDict, fontSize):
y = tm[5]
if y > 50 and y < 720:
parts.append(text)
# Selecting page
page_index = 3 #page x with index x-1
page = reader.pages[page_index]
# Extracting text
page.extract_text(visitor_text=visitor_question)
# Printing text
text_body = "".join(parts)
print(text_body)
答案1
得分: 0
文件结构良好 curl -o 2022-p-cad1-blue.pdf
https://download.inep.gov.br/enem/provas_e_gabaritos/2022_PV_impresso_D1_CD1.pdf#page=4
为什么不直接导出为文本文件(在右侧显示),然后在任何语言中解析它。
xpdf-tools-win-4.04\bin32>pdftotext -enc UTF-8 -f 4 -l 4 2022-p-cad1-blue.pdf -
通过使用 -nopgbrk
并添加 margint
和 maginb
,您可以去除大部分多余的内容,然后通过正则表达式或者通过两次遍历每页的左右半部分来避免中心水印。
要合并多个页面,只需选择范围 -f 2 -l 31
,例如排除以避免垂直文本。
pdftotext -nopgbrk -raw -enc UTF-8 -x 20 -y 50 -W 700 -H 700 -f 2 -l 31 2022-p-cad1-blue.pdf -|findstr /V /R "ENEM 2022" >page2-31.txt
英文:
The file structure is good curl -o 2022-p-cad1-blue.pdf
https://download.inep.gov.br/enem/provas_e_gabaritos/2022_PV_impresso_D1_CD1.pdf#page=4
So why not simply export to file as text (seen on the right) and parse that in any language.
xpdf-tools-win-4.04\bin32>pdftotext -enc UTF-8 -f 4 -l 4 2022-p-cad1-blue.pdf -
By using -nopgbrk
and adding margint and maginb you can remove most of the extra chatter and then just avoid the centre watermarking either with regex or by pulling left and right halves in two passes per page.
to join multiple pages simply select the range -f 2 -l 31
for example with exclusions to aVoid the vertical text
pdftotext -nopgbrk -raw -enc UTF-8 -x 20 -y 50 -W 700 -H 700 -f 2 -l 31 2022-p-cad1-blue.pdf -|findstr /V /R "ENEM 2022" >page2-31.txt
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论