2023年5月17日 19:13:22go评论99阅读模式

英文:

Regex to extract headings and sub-headings from a pdf file using python

问题

I have extracted the relevant content and translated it as requested:

import re
# 定义模式
pattern = r'^\s*\d+(\.\d+)?\. ((?:\b\w+\b\s*){1,5})'
# 在文本中查找所有匹配项
matches = re.findall(pattern, text, re.MULTILINE)
print(matches)

期望的输出是：

[1. Heading one, 1.1 subheading one of one, 1.2 subheading two of one, 2. Heading two, 2.1 subheading one of two please, 2.2 subheading two of two is, 2.3 subheading there of two]

英文:

I have a pdffile, using pdfplumber , I have extracted all the text from it. Then I need to find all headings and sub-headings from this text. I want to use the headings and sub-headings to extract the text within those headings and sub-headings.

My headings looks like 1. heading one 2. heading two 3. heading three 4. heading head four is and so on - they can have maximum 5 words

My subheadings looks like same as heading like 1.1 heading one of one 1.2 heading two of one 2.1 heading one of two 3.2 heading two of three and so. I am not able to do it. I tried following but did not work , it worked only partially ,it could find some of the heading but no sub headings

import re
# Define the pattern
pattern = r&#39;^\s*\d+(\.\d+)?\. ((?:\b\w+\b\s*){1,5})&#39;
# Find all matches in the text
matches = re.findall(pattern, text, re.MULTILINE)
print(matches)

and I want all the headings and sub headings to be returned in a list as mentioned above

Here is sample input data:

text= &quot;&quot;&quot;
lotsf text text text 
1. Heading one
lots of text lots of text lots of text lot of text
123 456 text text2
0 10 text
1.1 subheading one of one
lot of text lots of text text is all
lot of text.
text and text.
1.2 subheading two of one
i m a ML enginner
i work in M
i do work in oracle also
2. Heading two
text again again text more text
holding on
backup and recovery
2.1 subheading one of two please
text text text text text
2.2 subheading two of two is
text or numbers
10  text 6345
2.3 subheading there of two
000 text 34
0 devices 
so many phone devices
&quot;&quot;&quot;&quot;

and expected output is :

[ 1. Heading one , 1.1 subheading one of one , 1.2 subheading two of one,2. Heading two,2.1 subheading one of two please,2.2 subheading two of two is,2.3 subheading there of two]

答案1

得分: 0

它无法找到子标题的原因是 (\.\d+)?\. 正则表达式要求在标题号后始终有一个句点，而您的示例子标题在第二个数字后没有句点（它不是 1.1.，而只是 1.1）。要修复此问题，请编辑正则表达式为 ^(\d+\.\d* (?:\w+ *){1,5})

首先扩展 () 以包围您想要的所有内容
删除正则表达式的不必要部分：\s*，\b
将数字部分更改为 \d+\.\d* 以接受主/次标题

英文:

The reason it can't find the subheadings is (\.\d+)?\. the regex requires always having a dot after the heading number(s) and your example subheaders don't have a dot after the second number (its not 1.1. its just 1.1). To fix this edit regex to ^(\d+\.\d* (?:\w+ *){1,5})

First expand () to surround everything you want
Remove unnecessary part of regex: \s*, \b
Change digit part to \d+\.\d* to accept major/minor headings

答案2

得分: 0

这是一个用于处理文本任务的命令行脚本，可以将PDF文件转换为结构化列表。通过使用pdftotext将PDF转换为文本文件，然后使用Windows命令行操作来提取标题并创建列表。最终的输出可以在Python中进行进一步处理或在记事本中编辑。

英文:

it should be posible in the shell without any more than pdftotext to do most of the text task.

pdftotext input.pdf output.txt

you say give per example

lotsf text text text
etc...

so if a second command is in windows (or similar in nix sed)

type output.txt |findstr /R "^[0-9]\." >headings.txt

we get a file with the headings

1. Heading one
1.1 subheading one of one
1.2 subheading two of one
2. Heading two
2.1 subheading one of two please
2.2 subheading two of two is
2.3 subheading there of two

now it gets slightly harder as you want [ item 1., item2, item3 ]

so the closest to that is

@echo [>list.txt&&@for /f "tokens=*" %f in (headings.txt) do @echo/|@set /p=%f, >>list.txt

followed by echo ]>>list.txt

result

[
1. Heading one,  1.1 subheading one of one,  1.2 subheading two of one,  2. Heading two,  2.1 subheading one of two please,  2.2 subheading two of two is,  2.3 subheading there of two, ]

So windows console commands often cause two very minor problems

I should have tried to remove the initial line feed and
there is a loop final , where one would normally not add one but for CSV that's usually not a problem.

so all that could be a 4 line.cmd file for "drag and drop" any pdf and instantly output a structured list for downstream editing in python or notepad or use as bookmarks in source PDF.

potential batch file (will need tweaking for a user set of locations)

dropMeApdf2LIST.cmd (needs slightly different structure to command lines)

@echo off
&quot;path to poppler\bin\pdftotext.exe&quot; -enc UTF-8 -nopgbrk &quot;%~1&quot; &quot;%~dpn1.txt&quot;
type &quot;%~dpn1.txt&quot; |findstr /R &quot;^[0-9]\.&quot; &gt;&quot;%~dpn1-headings.txt&quot;`
echo [&gt;&quot;%~dpn1-list.txt&quot;
for /f &quot;usebackq tokens=*&quot; %%f in (&quot;%~dpn1-headings.txt&quot;) do @echo/|@set /p=%%f, &gt;&gt;&quot;%~dpn1-list.txt&quot;  
echo ]&gt;&gt;&quot;%~dpn1-list.txt&quot;
notepad &quot;%~dpn1-list.txt&quot;

so drop this page as a pdf onto that file gives me this, and humorously includes my list of 1. and 2. above.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

提取PDF文件中的标题和子标题的正则表达式，使用Python。

问题

答案1

答案2

Using Python and ctypes to pass a variable length string inside a structure to c function (generated by Matlab Coder)

Snakemake包装器突然停止工作

如何使用Python脚本将点分隔的字符串转换为YAML格式。

如何使用Python获取一个子论坛的总帖子数？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。