英文:
Regex to extract headings and sub-headings from a pdf file using python
问题
I have extracted the relevant content and translated it as requested:
import re
# 定义模式
pattern = r'^\s*\d+(\.\d+)?\. ((?:\b\w+\b\s*){1,5})'
# 在文本中查找所有匹配项
matches = re.findall(pattern, text, re.MULTILINE)
print(matches)
期望的输出是:
[1. Heading one, 1.1 subheading one of one, 1.2 subheading two of one, 2. Heading two, 2.1 subheading one of two please, 2.2 subheading two of two is, 2.3 subheading there of two]
英文:
I have a pdffile, using pdfplumber , I have extracted all the text from it. Then I need to find all headings and sub-headings from this text. I want to use the headings and sub-headings to extract the text within those headings and sub-headings.
My headings looks like 1. heading one 2. heading two 3. heading three 4. heading head four is and so on - they can have maximum 5 words
My subheadings looks like same as heading like 1.1 heading one of one 1.2 heading two of one 2.1 heading one of two 3.2 heading two of three and so. I am not able to do it. I tried following but did not work , it worked only partially ,it could find some of the heading but no sub headings
import re
# Define the pattern
pattern = r'^\s*\d+(\.\d+)?\. ((?:\b\w+\b\s*){1,5})'
# Find all matches in the text
matches = re.findall(pattern, text, re.MULTILINE)
print(matches)
and I want all the headings and sub headings to be returned in a list as mentioned above
Here is sample input data:
text= """
lotsf text text text
1. Heading one
lots of text lots of text lots of text lot of text
123 456 text text2
0 10 text
1.1 subheading one of one
lot of text lots of text text is all
lot of text.
text and text.
1.2 subheading two of one
i m a ML enginner
i work in M
i do work in oracle also
2. Heading two
text again again text more text
holding on
backup and recovery
2.1 subheading one of two please
text text text text text
2.2 subheading two of two is
text or numbers
10 text 6345
2.3 subheading there of two
000 text 34
0 devices
so many phone devices
""""
and expected output is :
[ 1. Heading one , 1.1 subheading one of one , 1.2 subheading two of one,2. Heading two,2.1 subheading one of two please,2.2 subheading two of two is,2.3 subheading there of two]
答案1
得分: 0
它无法找到子标题的原因是 (\.\d+)?\.
正则表达式要求在标题号后始终有一个句点,而您的示例子标题在第二个数字后没有句点(它不是 1.1.
,而只是 1.1
)。要修复此问题,请编辑正则表达式为 ^(\d+\.\d* (?:\w+ *){1,5})
- 首先扩展
()
以包围您想要的所有内容 - 删除正则表达式的不必要部分:
\s*
,\b
- 将数字部分更改为
\d+\.\d*
以接受主/次标题
英文:
The reason it can't find the subheadings is (\.\d+)?\.
the regex requires always having a dot after the heading number(s) and your example subheaders don't have a dot after the second number (its not 1.1.
its just 1.1
). To fix this edit regex to ^(\d+\.\d* (?:\w+ *){1,5})
- First expand
()
to surround everything you want - Remove unnecessary part of regex:
\s*
,\b
- Change digit part to
\d+\.\d*
to accept major/minor headings
答案2
得分: 0
这是一个用于处理文本任务的命令行脚本,可以将PDF文件转换为结构化列表。通过使用pdftotext将PDF转换为文本文件,然后使用Windows命令行操作来提取标题并创建列表。最终的输出可以在Python中进行进一步处理或在记事本中编辑。
英文:
it should be posible in the shell without any more than pdftotext to do most of the text task.
pdftotext input.pdf output.txt
you say give per example
lotsf text text text
etc...
so if a second command is in windows (or similar in nix sed)
type output.txt |findstr /R "^[0-9]\." >headings.txt
we get a file with the headings
1. Heading one
1.1 subheading one of one
1.2 subheading two of one
2. Heading two
2.1 subheading one of two please
2.2 subheading two of two is
2.3 subheading there of two
now it gets slightly harder as you want [ item 1., item2, item3 ]
so the closest to that is
@echo [>list.txt&&@for /f "tokens=*" %f in (headings.txt) do @echo/|@set /p=%f, >>list.txt
followed by echo ]>>list.txt
result
[
1. Heading one, 1.1 subheading one of one, 1.2 subheading two of one, 2. Heading two, 2.1 subheading one of two please, 2.2 subheading two of two is, 2.3 subheading there of two, ]
So windows console commands often cause two very minor problems
- I should have tried to remove the initial line feed and
- there is a loop final
,
where one would normally not add one but for CSV that's usually not a problem.
so all that could be a 4 line.cmd file for "drag and drop" any pdf and instantly output a structured list for downstream editing in python or notepad or use as bookmarks in source PDF.
potential batch file (will need tweaking for a user set of locations)
dropMeApdf2LIST.cmd (needs slightly different structure to command lines)
@echo off
"path to poppler\bin\pdftotext.exe" -enc UTF-8 -nopgbrk "%~1" "%~dpn1.txt"
type "%~dpn1.txt" |findstr /R "^[0-9]\." >"%~dpn1-headings.txt"`
echo [>"%~dpn1-list.txt"
for /f "usebackq tokens=*" %%f in ("%~dpn1-headings.txt") do @echo/|@set /p=%%f, >>"%~dpn1-list.txt"
echo ]>>"%~dpn1-list.txt"
notepad "%~dpn1-list.txt"
so drop this page as a pdf onto that file gives me this, and humorously includes my list of 1. and 2. above.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论