2023年6月22日 04:39:46go评论94阅读模式

英文:

How can I extract separated content from questions in a PDF of the ENEM (brazilian exam)?

问题

我想提取一份考试的问题，以建立一个数据集。这里有一个我正在处理的ENEM考试示例页面：

第4页 - ENEM 2022（第1天/蓝色）

这是2022年版的第4页，可以在这里的“microdados_enem_2022/PROVAS E GABARTIOS/ENEM_2022_P1_CAD_01_DIA_1_AZUL.pdf”目录中找到。

这是考试中普通页面的经典示例，在这个特定情况下，我选择了一个没有问题图像并且所有问题都在一页中的页面，以使提取更容易。此外，所需内容以颜色区分开来。因此，目标是生成一个数据集，其中包含问题列表，每个问题都具有以下特征：

文本（黄色）
命令或陈述（绿色）
选项（蓝色）

我该如何从这个考试中提取这些特征以生成数据集？

我正在尝试使用Python的PyPDF库，但我有些难以确定如何处理提取的文本以生成数据集。以下是目前的代码：

from PyPDF2 import PdfReader
# 打开读取器
reader = PdfReader("ENEM_2022_P1_CAD_01_DIA_1_AZUL.pdf")
parts = []
# 定义访问函数
def visitor_question(text, cm, tm, fontDict, fontSize):
    y = tm[5]
    if y > 50 and y < 720:
        parts.append(text)
# 选择页面
page_index = 3  # 选择页码，从0开始
page = reader.pages[page_index]
# 提取文本
page.extract_text(visitor_text=visitor_question)
# 打印文本
text_body = "".join(parts)
print(text_body)

请注意，以上代码中的注释已经翻译成中文。

英文:

I want to extract the questions of an exam for building a dataset. Here we have an example page of the ENEM, the specific exam I am working with:

Page 4 - ENEM 2022 (Day 1 / Blue)

This is the page 4 of 2022 edition, available here in "microdados_enem_2022/PROVAS E GABARTIOS/ENEM_2022_P1_CAD_01_DIA_1_AZUL.pdf" directory.

This is the classical example of a normal page in the exam, in this specific case, I selected a page with no image in the questions and with all the questions in only one page to make it easier. Besides that, the desired content is colored to separate what is what. So, the objective is to generate a dataset with a list of questions, each one with the features:

The text (in yellow)
The command or statement (in green)
The alternatives (in blue)

How can I extract this features for generate dataset from this exam?

I'm trying to use the PyPDF library for Python, but I'm having some difficult to know how to process the extracted text to generate the dataset. Here is the code at the moment:

from PyPDF2 import PdfReader
# Open reader
reader = PdfReader(&quot;ENEM_2022_P1_CAD_01_DIA_1_AZUL.pdf&quot;)
		
parts = []
		
# Defining visitor function
def visitor_question(text, cm, tm, fontDict, fontSize):
	y = tm[5]
	if y &gt; 50 and y &lt; 720:
		parts.append(text)
# Selecting page
page_index = 3 #page x with index x-1
page = reader.pages[page_index]
# Extracting text
page.extract_text(visitor_text=visitor_question)
# Printing text
text_body = &quot;&quot;.join(parts)
print(text_body)

答案1

得分: 0

文件结构良好 curl -o 2022-p-cad1-blue.pdf https://download.inep.gov.br/enem/provas_e_gabaritos/2022_PV_impresso_D1_CD1.pdf#page=4

为什么不直接导出为文本文件（在右侧显示），然后在任何语言中解析它。

xpdf-tools-win-4.04\bin32>pdftotext -enc UTF-8 -f 4 -l 4 2022-p-cad1-blue.pdf -

通过使用 -nopgbrk 并添加 margint 和 maginb，您可以去除大部分多余的内容，然后通过正则表达式或者通过两次遍历每页的左右半部分来避免中心水印。

要合并多个页面，只需选择范围 -f 2 -l 31，例如排除以避免垂直文本。

pdftotext -nopgbrk -raw -enc UTF-8 -x 20 -y 50 -W 700 -H 700 -f 2 -l 31 2022-p-cad1-blue.pdf -|findstr /V /R "ENEM 2022" >page2-31.txt

英文:

The file structure is good curl -o 2022-p-cad1-blue.pdf https://download.inep.gov.br/enem/provas_e_gabaritos/2022_PV_impresso_D1_CD1.pdf#page=4

So why not simply export to file as text (seen on the right) and parse that in any language.

xpdf-tools-win-4.04\bin32>pdftotext -enc UTF-8 -f 4 -l 4 2022-p-cad1-blue.pdf -

By using -nopgbrk and adding margint and maginb you can remove most of the extra chatter and then just avoid the centre watermarking either with regex or by pulling left and right halves in two passes per page.

to join multiple pages simply select the range -f 2 -l 31 for example with exclusions to aVoid the vertical text

pdftotext -nopgbrk -raw -enc UTF-8 -x 20 -y 50 -W 700 -H 700 -f 2 -l 31 2022-p-cad1-blue.pdf -|findstr /V /R "ENEM 2022" >page2-31.txt

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何从巴西ENEM考试的PDF文件中提取问题中的分隔内容？

问题

答案1

数据类型无法写入CSV。

添加不同的值到使用循环创建的条目。

DeprecationWarning和NoSuchElementException在使用Selenium自动化Twitter登录时发生。

Getting ModuleNotFoundError: No module named ‘kats’ even after successfully installing Kats

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。