提取PDF文件中的标题和子标题的正则表达式,使用Python。

huangapple go评论79阅读模式
英文:

Regex to extract headings and sub-headings from a pdf file using python

问题

I have extracted the relevant content and translated it as requested:

import re
# 定义模式
pattern = r'^\s*\d+(\.\d+)?\. ((?:\b\w+\b\s*){1,5})'
# 在文本中查找所有匹配项
matches = re.findall(pattern, text, re.MULTILINE)
print(matches)

期望的输出是:

[1. Heading one, 1.1 subheading one of one, 1.2 subheading two of one, 2. Heading two, 2.1 subheading one of two please, 2.2 subheading two of two is, 2.3 subheading there of two]
英文:

I have a pdffile, using pdfplumber , I have extracted all the text from it. Then I need to find all headings and sub-headings from this text. I want to use the headings and sub-headings to extract the text within those headings and sub-headings.

My headings looks like 1. heading one 2. heading two 3. heading three 4. heading head four is and so on - they can have maximum 5 words

My subheadings looks like same as heading like 1.1 heading one of one 1.2 heading two of one 2.1 heading one of two 3.2 heading two of three and so. I am not able to do it. I tried following but did not work , it worked only partially ,it could find some of the heading but no sub headings

import re
# Define the pattern
pattern = r'^\s*\d+(\.\d+)?\. ((?:\b\w+\b\s*){1,5})'
# Find all matches in the text
matches = re.findall(pattern, text, re.MULTILINE)
print(matches)

and I want all the headings and sub headings to be returned in a list as mentioned above

Here is sample input data:

text= """
lotsf text text text 

1. Heading one
lots of text lots of text lots of text lot of text
123 456 text text2
0 10 text

1.1 subheading one of one

lot of text lots of text text is all
lot of text.
text and text.

1.2 subheading two of one

i m a ML enginner
i work in M
i do work in oracle also

2. Heading two

text again again text more text
holding on
backup and recovery

2.1 subheading one of two please

text text text text text

2.2 subheading two of two is

text or numbers
10  text 6345

2.3 subheading there of two

000 text 34
0 devices 
so many phone devices
""""

and expected output is :

[ 1. Heading one , 1.1 subheading one of one , 1.2 subheading two of one,2. Heading two,2.1 subheading one of two please,2.2 subheading two of two is,2.3 subheading there of two]

答案1

得分: 0

它无法找到子标题的原因是 (\.\d+)?\. 正则表达式要求在标题号后始终有一个句点,而您的示例子标题在第二个数字后没有句点(它不是 1.1.,而只是 1.1)。要修复此问题,请编辑正则表达式为 ^(\d+\.\d* (?:\w+ *){1,5})

  • 首先扩展 () 以包围您想要的所有内容
  • 删除正则表达式的不必要部分:\s*\b
  • 将数字部分更改为 \d+\.\d* 以接受主/次标题
英文:

The reason it can't find the subheadings is (\.\d+)?\. the regex requires always having a dot after the heading number(s) and your example subheaders don't have a dot after the second number (its not 1.1. its just 1.1). To fix this edit regex to ^(\d+\.\d* (?:\w+ *){1,5})

  • First expand () to surround everything you want
  • Remove unnecessary part of regex: \s*, \b
  • Change digit part to \d+\.\d* to accept major/minor headings

答案2

得分: 0

这是一个用于处理文本任务的命令行脚本,可以将PDF文件转换为结构化列表。通过使用pdftotext将PDF转换为文本文件,然后使用Windows命令行操作来提取标题并创建列表。最终的输出可以在Python中进行进一步处理或在记事本中编辑。
英文:

it should be posible in the shell without any more than pdftotext to do most of the text task.

pdftotext input.pdf output.txt

you say give per example

lotsf text text text

etc...

so if a second command is in windows (or similar in nix sed)

type output.txt |findstr /R "^[0-9]\." >headings.txt

we get a file with the headings

1. Heading one
1.1 subheading one of one
1.2 subheading two of one
2. Heading two
2.1 subheading one of two please
2.2 subheading two of two is
2.3 subheading there of two

now it gets slightly harder as you want [ item 1., item2, item3 ]

so the closest to that is

@echo [>list.txt&&@for /f "tokens=*" %f in (headings.txt) do @echo/|@set /p=%f, >>list.txt

followed by echo ]>>list.txt

result

[
1. Heading one,  1.1 subheading one of one,  1.2 subheading two of one,  2. Heading two,  2.1 subheading one of two please,  2.2 subheading two of two is,  2.3 subheading there of two, ]

So windows console commands often cause two very minor problems

  1. I should have tried to remove the initial line feed and
  2. there is a loop final , where one would normally not add one but for CSV that's usually not a problem.

so all that could be a 4 line.cmd file for "drag and drop" any pdf and instantly output a structured list for downstream editing in python or notepad or use as bookmarks in source PDF.

potential batch file (will need tweaking for a user set of locations)

dropMeApdf2LIST.cmd (needs slightly different structure to command lines)

@echo off
"path to poppler\bin\pdftotext.exe" -enc UTF-8 -nopgbrk "%~1" "%~dpn1.txt"
type "%~dpn1.txt" |findstr /R "^[0-9]\." >"%~dpn1-headings.txt"`
echo [>"%~dpn1-list.txt"
for /f "usebackq tokens=*" %%f in ("%~dpn1-headings.txt") do @echo/|@set /p=%%f, >>"%~dpn1-list.txt"  
echo ]>>"%~dpn1-list.txt"
notepad "%~dpn1-list.txt"

so drop this page as a pdf onto that file gives me this, and humorously includes my list of 1. and 2. above.

提取PDF文件中的标题和子标题的正则表达式,使用Python。

huangapple
  • 本文由 发表于 2023年5月17日 19:13:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/76271465.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定