pdfplumber表格提取不一致的列和去除空格

huangapple go评论72阅读模式
英文:

pdfplumber table-extract inconsistent columns and stripping spaces

问题

Pdfplumber是我迄今为止发现的最准确的从PDF中提取文本的工具,而且它可以提取表格数据的行和列。我遇到了两个表格功能的问题。

  1. 一个宽列的文本(例如描述)可能会被拆分成较小的列,也可能不会。
  2. 在将拆分的字符串连接以重新形成描述性列时,每个拆分字符串的开头和结尾的原始空格已被删除,导致重新组装时不正确。
    任何建议都将不胜感激。

这个示例从每个PDF中提取表格。这两个表格是相同的,只是第二个表格的行数较少。
问题1:
第一个表格显示最左边的列分为三列,而第二个表格中相同的数据没有分列。是否有可能避免分列?
问题2:
在将第一列分为3部分时,部分之间的空格被删除。即
'Balance at 31 December 2020'被拆分为'Balance at 31 Decem', 'ber 2020', ''。简单地将这些部分连接起来可以恢复文本为'Balance at 31 December 2020' - 正确的。然而,'Total comprehensive income for the year'被拆分为'Total compre', 'hensive', 'income for', 'the year',将这些部分连接起来会导致'Total comprehensiveincome forthe year' - 错误。

PDF文件链接:
pdfplumber拆分第一列的文件:
https://www.dropbox.com/s/qlqr27s29vk79j4/pdfdoc-sheet3.pdf?dl=0
pdfplumber保持第一列完整的文件:
https://www.dropbox.com/s/0cz8szmph847sin/pdfdoc-sheet4.pdf?dl=0

示例代码:

import pdfplumber

filepaths = ('C:/ProgramData/PythonProgs/pdfdoc-sheet3.pdf',
             'C:/ProgramData/PythonProgs/pdfdoc-sheet4.pdf')
for filepath in filepaths:
    print('----------------------------------------')
    pdf = pdfplumber.open(filepath)
    for page in pdf.pages:
        text = page.extract_text()
        textlines = text.split('\n')
        tablelines = page.extract_table(table_settings=
                {"vertical_strategy": "text", 
                 "horizontal_strategy": "text", 
                 "snap_tolerance":5}) # snap_tolernace 4 - 9 works
        for i in range(len(tablelines)):
            print(i, tablelines[i])

输出:
-------------- C:/ProgramData/PythonProgs/pdfdoc-sheet3.pdf ----------------
0 ['', '', '', '', 'Notes', 'Share']
1 ['', '', '', '', '', 'capital']
2 ['', '', '', '', '', '']
3 ['Balance at 1', 'January', '2021', '', '', '12,000']
4 ['Dividends', '', '', '', '', '-']
5 ['Issue of shar', 'e capital', 'on exercis', 'e of', '', '270']
6 ['Employee sh', 'are-base', 'd compens', 'ation', '', '-']
7 ['Issue of shar', 'e capital', 'on private', 'placement', '', '1,500']
8 ['Transactions', 'with own', 'ers', '', 'note1', '1,770']
9 ['Profit for the', 'year', '', '', 'note2', '-']
10 ['Other compre', 'hensive', 'income', '', '', '-']
11 ['Total compre', 'hensive', 'income for', 'the year', 'note3', '-']
12 ['Balance at 3', '1 Decem', 'ber 2021', '', '', '13,770']
13 ['Balance at 1', 'January', '2020', '', '', '12,000']
14 ['Employee sh', 'are-base', 'd compens', 'ation', '', '-']
15 ['Transactions', 'with own', 'ers', '', '', '-']
16 ['Profit for the', 'year', '', '', '', '-']
17 ['Other compre', 'hensive', 'income', '', '', '-']
18 ['Total compre', 'hensive', 'income for', 'the year', '', '-']
19 ['Balance at 3', '1 Decem', 'ber 2020', '', '', '12,000']
-------------- C:/ProgramData/PythonProgs/pdfdoc-sheet4.pdf ----------------
0 ['', 'Notes', 'Share']
1 ['', '', 'capital']
2 ['', '', '']
3 ['Balance at 1 January 2021', '', '12,000']
4 ['Transactions with owners', 'note1', '1,770']
5 ['Profit for the year', 'note2', '-']
6 ['Other comprehensive income', '', '-']
7 ['', '', '']
8 ['Total comprehensive income for the year', 'note3', '-']
9 ['', '', '']
10 ['Balance at 31 December 2020', '', '12,000']


<details>
<summary>英文:</summary>

Pdfplumber is the most accurate tool I have found so far for extracting text from a PDF, plus it can extract table data in rows and columns.  I have encountered two problems with the table function.
1. a wide column of text (e.g. a description) may be split into smaller columns, or may not.
2. when joining the split strings to re-form the descriptive column, original white space at start and end of each split-string has been removed, resulting in incorrect re-assembly.
All advice would be appreciated.

This sample extracts a table from each PDF.  The two tables are identical except that the 2nd has fewer lines.
Issue 1: 
The table from the first shows the leftmost column split into three columns, while the the identical data in the 2nd table is not split. Is it possible to avoid splitting a column?
Issue 2:
in splitting the first column into 3 parts, whitespace is removed between the parts.  I.e. 
&#39;Balance at 31 December 2020&#39; is split into &#39;Balance at 31 Decem&#39;, &#39;ber 2020&#39;, &#39;&#39;.  Simply joining the parts restores the text to &#39;Balance at 31 December 2020&#39; - correct.  However &#39;Total comprehensive income for the year&#39; is split into &#39;Total compre&#39;, &#39;hensive&#39;, &#39;income for&#39;, &#39;the year&#39; and joining the parts results in &#39;Total comprehensiveincome forthe year&#39; - wrong.

Links to the PDF files:
The file in which pdfplumber splits first column:
    https://www.dropbox.com/s/qlqr27s29vk79j4/pdfdoc-sheet3.pdf?dl=0
The file in which pdfplumber keeps the first column intact:
    https://www.dropbox.com/s/0cz8szmph847sin/pdfdoc-sheet4.pdf?dl=0

Sample code:

~~~
    import pdfplumber

    filepaths = (&#39;C:/ProgramData/PythonProgs/pdfdoc-sheet3.pdf&#39;,\
                 &#39;C:/ProgramData/PythonProgs/pdfdoc-sheet4.pdf&#39;)
    for filepath in filepaths:
        print(&#39;----------------------------------------&#39;)
        pdf = pdfplumber.open(filepath)
        for page in pdf.pages:
            text = page.extract_text()
            textlines = text.split(&#39;\n&#39;)
            tablelines = page.extract_table(table_settings=\
                    {&quot;vertical_strategy&quot;: &quot;text&quot;, \
                     &quot;horizontal_strategy&quot;: &quot;text&quot;, \
                     &quot;snap_tolerance&quot;:5}) # snap_tolernace 4 - 9 works
            for i in range(len(tablelines)):
                print(i, tablelines[i])
~~~


Output:
~~~--------------  C:/ProgramData/PythonProgs/pdfdoc-sheet3.pdf  ----------------
0 [&#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;Notes&#39;, &#39;Share&#39;]
1 [&#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;capital&#39;]
2 [&#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;]
3 [&#39;Balance at 1&#39;, &#39;January&#39;, &#39;2021&#39;, &#39;&#39;, &#39;&#39;, &#39;12,000&#39;]
4 [&#39;Dividends&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;-&#39;]
5 [&#39;Issue of shar&#39;, &#39;e capital&#39;, &#39;on exercis&#39;, &#39;e of&#39;, &#39;&#39;, &#39;270&#39;]
6 [&#39;Employee sh&#39;, &#39;are-base&#39;, &#39;d compens&#39;, &#39;ation&#39;, &#39;&#39;, &#39;-&#39;]
7 [&#39;Issue of shar&#39;, &#39;e capital&#39;, &#39;on private&#39;, &#39;placement&#39;, &#39;&#39;, &#39;1,500&#39;]
8 [&#39;Transactions&#39;, &#39;with own&#39;, &#39;ers&#39;, &#39;&#39;, &#39;note1&#39;, &#39;1,770&#39;]
9 [&#39;Profit for the&#39;, &#39;year&#39;, &#39;&#39;, &#39;&#39;, &#39;note2&#39;, &#39;-&#39;]
10 [&#39;Other compre&#39;, &#39;hensive&#39;, &#39;income&#39;, &#39;&#39;, &#39;&#39;, &#39;-&#39;]
11 [&#39;Total compre&#39;, &#39;hensive&#39;, &#39;income for&#39;, &#39;the year&#39;, &#39;note3&#39;, &#39;-&#39;]
12 [&#39;Balance at 3&#39;, &#39;1 Decem&#39;, &#39;ber 2021&#39;, &#39;&#39;, &#39;&#39;, &#39;13,770&#39;]
13 [&#39;Balance at 1&#39;, &#39;January&#39;, &#39;2020&#39;, &#39;&#39;, &#39;&#39;, &#39;12,000&#39;]
14 [&#39;Employee sh&#39;, &#39;are-base&#39;, &#39;d compens&#39;, &#39;ation&#39;, &#39;&#39;, &#39;-&#39;]
15 [&#39;Transactions&#39;, &#39;with own&#39;, &#39;ers&#39;, &#39;&#39;, &#39;&#39;, &#39;-&#39;]
16 [&#39;Profit for the&#39;, &#39;year&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;-&#39;]
17 [&#39;Other compre&#39;, &#39;hensive&#39;, &#39;income&#39;, &#39;&#39;, &#39;&#39;, &#39;-&#39;]
18 [&#39;Total compre&#39;, &#39;hensive&#39;, &#39;income for&#39;, &#39;the year&#39;, &#39;&#39;, &#39;-&#39;]
19 [&#39;Balance at 3&#39;, &#39;1 Decem&#39;, &#39;ber 2020&#39;, &#39;&#39;, &#39;&#39;, &#39;12,000&#39;]
--------------  C:/ProgramData/PythonProgs/pdfdoc-sheet4.pdf  ----------------
0 [&#39;&#39;, &#39;Notes&#39;, &#39;Share&#39;]
1 [&#39;&#39;, &#39;&#39;, &#39;capital&#39;]
2 [&#39;&#39;, &#39;&#39;, &#39;&#39;]
3 [&#39;Balance at 1 January 2021&#39;, &#39;&#39;, &#39;12,000&#39;]
4 [&#39;Transactions with owners&#39;, &#39;note1&#39;, &#39;1,770&#39;]
5 [&#39;Profit for the year&#39;, &#39;note2&#39;, &#39;-&#39;]
6 [&#39;Other comprehensive income&#39;, &#39;&#39;, &#39;-&#39;]
7 [&#39;&#39;, &#39;&#39;, &#39;&#39;]
8 [&#39;Total comprehensive income for the year&#39;, &#39;note3&#39;, &#39;-&#39;]
9 [&#39;&#39;, &#39;&#39;, &#39;&#39;]
10 [&#39;Balance at 31 December 2020&#39;, &#39;&#39;, &#39;12,000&#39;]
~~~


</details>


# 答案1
**得分**: 2

你能将标题用作垂直线标记吗?

```python
headers = [
   page1.search('Notes')[0],
   page1.search('Share capital')[0],
]

vlines = [
    1,
    headers[0]['x0'],
    headers[1]['x0'],
    headers[1]['x1'],
]

hlines = 
for line in page1.vertical_edges] # 我们需要添加顶部/底部线以获取第一行/最后一行 hlines.insert(0, headers[-1]['bottom']) hlines.append(page1.vertical_edges[-1]['bottom'] + 10) im = page1.to_image(300) im.draw_vlines(vlines, stroke_width=3) im.draw_hlines(hlines, stroke_width=3) im.save('lines.png') page1.extract_table(dict( explicit_vertical_lines=vlines, explicit_horizontal_lines=hlines, ))

第一页表格数据:

[['Balance at 1 January 2021', '', '12,000'],
 ['Dividends', '', '-'],
 ['Issue of share capital on exercise of', '', '270'],
 ['Employee share-based compensation', '', '-'],
 ['Issue of share capital on private placement', '', '1,500'],
 ['Transactions with owners', 'note1', '1,770'],
 ['Profit for the year', 'note2', '-'],
 ['Other comprehensive income', '', '-'],
 ['Total comprehensive income for the year', 'note3', '-'],
 ['Balance at 31 December 2021', '', '13,770'],
 ['Balance at 1 January 2020', '', '12,000'],
 ['Employee share-based compensation', '', '-'],
 ['Transactions with owners', '', '-'],
 ['Profit for the year', '', '-'],
 ['Other comprehensive income', '', '-'],
 ['Total comprehensive income for the year', '', '-'],
 ['Balance at 31 December 2020', '', '12,000']]

对第二页执行相同操作:

[['Balance at 1 January 2021', '', '12,000'],
 ['Transactions with owners', 'note1', '1,770'],
 ['Profit for the year', 'note2', '-'],
 ['Other comprehensive income', '', '-'],
 ['Total comprehensive income for the year', 'note3', '-'],
 ['Balance at 31 December 2020', '', '12,000']]
英文:

Can you use the headers as vertical line markers?
pdfplumber表格提取不一致的列和去除空格

headers = [
   page1.search(&#39;Notes&#39;)[0],
   page1.search(&#39;Share\s+capital&#39;)[0],
]

vlines = [
    1,
    headers[0][&#39;x0&#39;],
    headers[1][&#39;x0&#39;],
    headers[1][&#39;x1&#39;],
]

hlines = 
for line in page1.vertical_edges] # we need to add top/bottom lines to get first/last rows hlines.insert(0, headers[-1][&#39;bottom&#39;]) hlines.append(page1.vertical_edges[-1][&#39;bottom&#39;] + 10) &quot;&quot;&quot; im = page1.to_image(300) im.draw_vlines(vlines, stroke_width=3) im.draw_hlines(hlines, stroke_width=3) im.save(&#39;lines.png&#39;) &quot;&quot;&quot; page1.extract_table(dict( explicit_vertical_lines = vlines, explicit_horizontal_lines = hlines, ))
[[&#39;Balance at 1 January 2021&#39;, &#39;&#39;, &#39;12,000&#39;],
 [&#39;Dividends&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Issue of share capital on exercise of&#39;, &#39;&#39;, &#39;270&#39;],
 [&#39;Employee share-based compensation&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Issue of share capital on private placement&#39;, &#39;&#39;, &#39;1,500&#39;],
 [&#39;Transactions with owners&#39;, &#39;note1&#39;, &#39;1,770&#39;],
 [&#39;Profit for the year&#39;, &#39;note2&#39;, &#39;-&#39;],
 [&#39;Other comprehensive income&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Total comprehensive income for the year&#39;, &#39;note3&#39;, &#39;-&#39;],
 [&#39;Balance at 31 December 2021&#39;, &#39;&#39;, &#39;13,770&#39;],
 [&#39;Balance at 1 January 2020&#39;, &#39;&#39;, &#39;12,000&#39;],
 [&#39;Employee share-based compensation&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Transactions with owners&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Profit for the year&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Other comprehensive income&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Total comprehensive income for the year&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Balance at 31 December 2020&#39;, &#39;&#39;, &#39;12,000&#39;]]

Doing the same for page 2:

[[&#39;Balance at 1 January 2021&#39;, &#39;&#39;, &#39;12,000&#39;],
 [&#39;Transactions with owners&#39;, &#39;note1&#39;, &#39;1,770&#39;],
 [&#39;Profit for the year&#39;, &#39;note2&#39;, &#39;-&#39;],
 [&#39;Other comprehensive income&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Total comprehensive income for the year&#39;, &#39;note3&#39;, &#39;-&#39;],
 [&#39;Balance at 31 December 2020&#39;, &#39;&#39;, &#39;12,000&#39;]]

huangapple
  • 本文由 发表于 2023年7月6日 21:57:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76629627.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定