问题

Pdfplumber是我迄今为止发现的最准确的从PDF中提取文本的工具，而且它可以提取表格数据的行和列。我遇到了两个表格功能的问题。

一个宽列的文本（例如描述）可能会被拆分成较小的列，也可能不会。
在将拆分的字符串连接以重新形成描述性列时，每个拆分字符串的开头和结尾的原始空格已被删除，导致重新组装时不正确。
任何建议都将不胜感激。

这个示例从每个PDF中提取表格。这两个表格是相同的，只是第二个表格的行数较少。
问题1：
第一个表格显示最左边的列分为三列，而第二个表格中相同的数据没有分列。是否有可能避免分列？
问题2：
在将第一列分为3部分时，部分之间的空格被删除。即
'Balance at 31 December 2020'被拆分为'Balance at 31 Decem', 'ber 2020', ''。简单地将这些部分连接起来可以恢复文本为'Balance at 31 December 2020' - 正确的。然而，'Total comprehensive income for the year'被拆分为'Total compre', 'hensive', 'income for', 'the year'，将这些部分连接起来会导致'Total comprehensiveincome forthe year' - 错误。

PDF文件链接：
pdfplumber拆分第一列的文件：
https://www.dropbox.com/s/qlqr27s29vk79j4/pdfdoc-sheet3.pdf?dl=0
pdfplumber保持第一列完整的文件：
https://www.dropbox.com/s/0cz8szmph847sin/pdfdoc-sheet4.pdf?dl=0

示例代码：

import pdfplumber

filepaths = ('C:/ProgramData/PythonProgs/pdfdoc-sheet3.pdf',
             'C:/ProgramData/PythonProgs/pdfdoc-sheet4.pdf')
for filepath in filepaths:
    print('----------------------------------------')
    pdf = pdfplumber.open(filepath)
    for page in pdf.pages:
        text = page.extract_text()
        textlines = text.split('\n')
        tablelines = page.extract_table(table_settings=
                {"vertical_strategy": "text", 
                 "horizontal_strategy": "text", 
                 "snap_tolerance":5}) # snap_tolernace 4 - 9 works
        for i in range(len(tablelines)):
            print(i, tablelines[i])

输出：
-------------- C:/ProgramData/PythonProgs/pdfdoc-sheet3.pdf ----------------
0 ['', '', '', '', 'Notes', 'Share']
1 ['', '', '', '', '', 'capital']
2 ['', '', '', '', '', '']
3 ['Balance at 1', 'January', '2021', '', '', '12,000']
4 ['Dividends', '', '', '', '', '-']
5 ['Issue of shar', 'e capital', 'on exercis', 'e of', '', '270']
6 ['Employee sh', 'are-base', 'd compens', 'ation', '', '-']
7 ['Issue of shar', 'e capital', 'on private', 'placement', '', '1,500']
8 ['Transactions', 'with own', 'ers', '', 'note1', '1,770']
9 ['Profit for the', 'year', '', '', 'note2', '-']
10 ['Other compre', 'hensive', 'income', '', '', '-']
11 ['Total compre', 'hensive', 'income for', 'the year', 'note3', '-']
12 ['Balance at 3', '1 Decem', 'ber 2021', '', '', '13,770']
13 ['Balance at 1', 'January', '2020', '', '', '12,000']
14 ['Employee sh', 'are-base', 'd compens', 'ation', '', '-']
15 ['Transactions', 'with own', 'ers', '', '', '-']
16 ['Profit for the', 'year', '', '', '', '-']
17 ['Other compre', 'hensive', 'income', '', '', '-']
18 ['Total compre', 'hensive', 'income for', 'the year', '', '-']
19 ['Balance at 3', '1 Decem', 'ber 2020', '', '', '12,000']
-------------- C:/ProgramData/PythonProgs/pdfdoc-sheet4.pdf ----------------
0 ['', 'Notes', 'Share']
1 ['', '', 'capital']
2 ['', '', '']
3 ['Balance at 1 January 2021', '', '12,000']
4 ['Transactions with owners', 'note1', '1,770']
5 ['Profit for the year', 'note2', '-']
6 ['Other comprehensive income', '', '-']
7 ['', '', '']
8 ['Total comprehensive income for the year', 'note3', '-']
9 ['', '', '']
10 ['Balance at 31 December 2020', '', '12,000']


<details>
<summary>英文:</summary>

Pdfplumber is the most accurate tool I have found so far for extracting text from a PDF, plus it can extract table data in rows and columns.  I have encountered two problems with the table function.
1. a wide column of text (e.g. a description) may be split into smaller columns, or may not.
2. when joining the split strings to re-form the descriptive column, original white space at start and end of each split-string has been removed, resulting in incorrect re-assembly.
All advice would be appreciated.

This sample extracts a table from each PDF.  The two tables are identical except that the 2nd has fewer lines.
Issue 1: 
The table from the first shows the leftmost column split into three columns, while the the identical data in the 2nd table is not split. Is it possible to avoid splitting a column?
Issue 2:
in splitting the first column into 3 parts, whitespace is removed between the parts.  I.e. 
&#39;Balance at 31 December 2020&#39; is split into &#39;Balance at 31 Decem&#39;, &#39;ber 2020&#39;, &#39;&#39;.  Simply joining the parts restores the text to &#39;Balance at 31 December 2020&#39; - correct.  However &#39;Total comprehensive income for the year&#39; is split into &#39;Total compre&#39;, &#39;hensive&#39;, &#39;income for&#39;, &#39;the year&#39; and joining the parts results in &#39;Total comprehensiveincome forthe year&#39; - wrong.

Links to the PDF files:
The file in which pdfplumber splits first column:
    https://www.dropbox.com/s/qlqr27s29vk79j4/pdfdoc-sheet3.pdf?dl=0
The file in which pdfplumber keeps the first column intact:
    https://www.dropbox.com/s/0cz8szmph847sin/pdfdoc-sheet4.pdf?dl=0

Sample code:

~~~
    import pdfplumber

    filepaths = (&#39;C:/ProgramData/PythonProgs/pdfdoc-sheet3.pdf&#39;,\
                 &#39;C:/ProgramData/PythonProgs/pdfdoc-sheet4.pdf&#39;)
    for filepath in filepaths:
        print(&#39;----------------------------------------&#39;)
        pdf = pdfplumber.open(filepath)
        for page in pdf.pages:
            text = page.extract_text()
            textlines = text.split(&#39;\n&#39;)
            tablelines = page.extract_table(table_settings=\
                    {&quot;vertical_strategy&quot;: &quot;text&quot;, \
                     &quot;horizontal_strategy&quot;: &quot;text&quot;, \
                     &quot;snap_tolerance&quot;:5}) # snap_tolernace 4 - 9 works
            for i in range(len(tablelines)):
                print(i, tablelines[i])
~~~


Output:
~~~--------------  C:/ProgramData/PythonProgs/pdfdoc-sheet3.pdf  ----------------
0 [&#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;Notes&#39;, &#39;Share&#39;]
1 [&#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;capital&#39;]
2 [&#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;]
3 [&#39;Balance at 1&#39;, &#39;January&#39;, &#39;2021&#39;, &#39;&#39;, &#39;&#39;, &#39;12,000&#39;]
4 [&#39;Dividends&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;-&#39;]
5 [&#39;Issue of shar&#39;, &#39;e capital&#39;, &#39;on exercis&#39;, &#39;e of&#39;, &#39;&#39;, &#39;270&#39;]
6 [&#39;Employee sh&#39;, &#39;are-base&#39;, &#39;d compens&#39;, &#39;ation&#39;, &#39;&#39;, &#39;-&#39;]
7 [&#39;Issue of shar&#39;, &#39;e capital&#39;, &#39;on private&#39;, &#39;placement&#39;, &#39;&#39;, &#39;1,500&#39;]
8 [&#39;Transactions&#39;, &#39;with own&#39;, &#39;ers&#39;, &#39;&#39;, &#39;note1&#39;, &#39;1,770&#39;]
9 [&#39;Profit for the&#39;, &#39;year&#39;, &#39;&#39;, &#39;&#39;, &#39;note2&#39;, &#39;-&#39;]
10 [&#39;Other compre&#39;, &#39;hensive&#39;, &#39;income&#39;, &#39;&#39;, &#39;&#39;, &#39;-&#39;]
11 [&#39;Total compre&#39;, &#39;hensive&#39;, &#39;income for&#39;, &#39;the year&#39;, &#39;note3&#39;, &#39;-&#39;]
12 [&#39;Balance at 3&#39;, &#39;1 Decem&#39;, &#39;ber 2021&#39;, &#39;&#39;, &#39;&#39;, &#39;13,770&#39;]
13 [&#39;Balance at 1&#39;, &#39;January&#39;, &#39;2020&#39;, &#39;&#39;, &#39;&#39;, &#39;12,000&#39;]
14 [&#39;Employee sh&#39;, &#39;are-base&#39;, &#39;d compens&#39;, &#39;ation&#39;, &#39;&#39;, &#39;-&#39;]
15 [&#39;Transactions&#39;, &#39;with own&#39;, &#39;ers&#39;, &#39;&#39;, &#39;&#39;, &#39;-&#39;]
16 [&#39;Profit for the&#39;, &#39;year&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;-&#39;]
17 [&#39;Other compre&#39;, &#39;hensive&#39;, &#39;income&#39;, &#39;&#39;, &#39;&#39;, &#39;-&#39;]
18 [&#39;Total compre&#39;, &#39;hensive&#39;, &#39;income for&#39;, &#39;the year&#39;, &#39;&#39;, &#39;-&#39;]
19 [&#39;Balance at 3&#39;, &#39;1 Decem&#39;, &#39;ber 2020&#39;, &#39;&#39;, &#39;&#39;, &#39;12,000&#39;]
--------------  C:/ProgramData/PythonProgs/pdfdoc-sheet4.pdf  ----------------
0 [&#39;&#39;, &#39;Notes&#39;, &#39;Share&#39;]
1 [&#39;&#39;, &#39;&#39;, &#39;capital&#39;]
2 [&#39;&#39;, &#39;&#39;, &#39;&#39;]
3 [&#39;Balance at 1 January 2021&#39;, &#39;&#39;, &#39;12,000&#39;]
4 [&#39;Transactions with owners&#39;, &#39;note1&#39;, &#39;1,770&#39;]
5 [&#39;Profit for the year&#39;, &#39;note2&#39;, &#39;-&#39;]
6 [&#39;Other comprehensive income&#39;, &#39;&#39;, &#39;-&#39;]
7 [&#39;&#39;, &#39;&#39;, &#39;&#39;]
8 [&#39;Total comprehensive income for the year&#39;, &#39;note3&#39;, &#39;-&#39;]
9 [&#39;&#39;, &#39;&#39;, &#39;&#39;]
10 [&#39;Balance at 31 December 2020&#39;, &#39;&#39;, &#39;12,000&#39;]
~~~


</details>


# 答案1
**得分**: 2

你能将标题用作垂直线标记吗？

```python
headers = [
   page1.search('Notes')[0],
   page1.search('Share capital')[0],
]

vlines = [
    1,
    headers[0]['x0'],
    headers[1]['x0'],
    headers[1]['x1'],
]

hlines =  for line in page1.vertical_edges]

# 我们需要添加顶部/底部线以获取第一行/最后一行
hlines.insert(0, headers[-1]['bottom'])
hlines.append(page1.vertical_edges[-1]['bottom'] + 10)

im = page1.to_image(300)
im.draw_vlines(vlines, stroke_width=3)
im.draw_hlines(hlines, stroke_width=3)
im.save('lines.png')

page1.extract_table(dict(
   explicit_vertical_lines=vlines,
   explicit_horizontal_lines=hlines,
))

第一页表格数据：

[['Balance at 1 January 2021', '', '12,000'],
 ['Dividends', '', '-'],
 ['Issue of share capital on exercise of', '', '270'],
 ['Employee share-based compensation', '', '-'],
 ['Issue of share capital on private placement', '', '1,500'],
 ['Transactions with owners', 'note1', '1,770'],
 ['Profit for the year', 'note2', '-'],
 ['Other comprehensive income', '', '-'],
 ['Total comprehensive income for the year', 'note3', '-'],
 ['Balance at 31 December 2021', '', '13,770'],
 ['Balance at 1 January 2020', '', '12,000'],
 ['Employee share-based compensation', '', '-'],
 ['Transactions with owners', '', '-'],
 ['Profit for the year', '', '-'],
 ['Other comprehensive income', '', '-'],
 ['Total comprehensive income for the year', '', '-'],
 ['Balance at 31 December 2020', '', '12,000']]

对第二页执行相同操作：

[['Balance at 1 January 2021', '', '12,000'],
 ['Transactions with owners', 'note1', '1,770'],
 ['Profit for the year', 'note2', '-'],
 ['Other comprehensive income', '', '-'],
 ['Total comprehensive income for the year', 'note3', '-'],
 ['Balance at 31 December 2020', '', '12,000']]

英文:

Can you use the headers as vertical line markers?

headers = [
   page1.search(&#39;Notes&#39;)[0],
   page1.search(&#39;Share\s+capital&#39;)[0],
]

vlines = [
    1,
    headers[0][&#39;x0&#39;],
    headers[1][&#39;x0&#39;],
    headers[1][&#39;x1&#39;],
]

hlines =  for line in page1.vertical_edges]

# we need to add top/bottom lines to get first/last rows
hlines.insert(0, headers[-1][&#39;bottom&#39;])
hlines.append(page1.vertical_edges[-1][&#39;bottom&#39;] + 10)

&quot;&quot;&quot;
im = page1.to_image(300)
im.draw_vlines(vlines, stroke_width=3)
im.draw_hlines(hlines, stroke_width=3)
im.save(&#39;lines.png&#39;)
&quot;&quot;&quot;

page1.extract_table(dict(
   explicit_vertical_lines = vlines,
   explicit_horizontal_lines = hlines,
))

[[&#39;Balance at 1 January 2021&#39;, &#39;&#39;, &#39;12,000&#39;],
 [&#39;Dividends&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Issue of share capital on exercise of&#39;, &#39;&#39;, &#39;270&#39;],
 [&#39;Employee share-based compensation&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Issue of share capital on private placement&#39;, &#39;&#39;, &#39;1,500&#39;],
 [&#39;Transactions with owners&#39;, &#39;note1&#39;, &#39;1,770&#39;],
 [&#39;Profit for the year&#39;, &#39;note2&#39;, &#39;-&#39;],
 [&#39;Other comprehensive income&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Total comprehensive income for the year&#39;, &#39;note3&#39;, &#39;-&#39;],
 [&#39;Balance at 31 December 2021&#39;, &#39;&#39;, &#39;13,770&#39;],
 [&#39;Balance at 1 January 2020&#39;, &#39;&#39;, &#39;12,000&#39;],
 [&#39;Employee share-based compensation&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Transactions with owners&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Profit for the year&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Other comprehensive income&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Total comprehensive income for the year&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Balance at 31 December 2020&#39;, &#39;&#39;, &#39;12,000&#39;]]

Doing the same for page 2:

[[&#39;Balance at 1 January 2021&#39;, &#39;&#39;, &#39;12,000&#39;],
 [&#39;Transactions with owners&#39;, &#39;note1&#39;, &#39;1,770&#39;],
 [&#39;Profit for the year&#39;, &#39;note2&#39;, &#39;-&#39;],
 [&#39;Other comprehensive income&#39;, &#39;&#39;, &#39;-&#39;],
 [&#39;Total comprehensive income for the year&#39;, &#39;note3&#39;, &#39;-&#39;],
 [&#39;Balance at 31 December 2020&#39;, &#39;&#39;, &#39;12,000&#39;]]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

pdfplumber表格提取不一致的列和去除空格

问题

如何阻止matplotlib在垂直渐近线处不正确地添加线条？

在NumPy数组中求和十进制数。

Python – 如何在不使用第三方库的情况下读取图像像素？

将三列的NumPy数组转换为元组字典？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论