2023年6月30日 04:42:31go评论128阅读模式

英文:

pdfplumber python extract_tables setting for the specific strategy

问题

I can help you with the Chinese translation. Here's the translated content:

我有这个PDF文件，我正在尝试从中提取表格。获取表格的更好策略是什么？我无法获取表格中的特定值，例如在第一个表格中，我需要获取[70,75,80,85,90,100,105,110,115,120]，而在第二行中需要获取[0,0,2,6,10,10,10,2,2,0,0]。

我的最终结果将是：411924，KGDHN，MBELT W 40 INT，T.GG SUPREME/SELLERIA，9643 BEIGE EBONY/COCOA，[70,75,80,85,90,100,105,110,115,120]，[0,0,2,6,10,10,10,2,2,0,0]，42，200.00，8，400.00

这是一个包含文本的PDF。我可以轻松提取文本，并保持布局几乎不变。

对于PDF中的每一页，您可以使用以下代码提取文本并保持布局几乎不变：

import pdfplumber
with pdfplumber.open(doc) as pdf:
    for page in pdf.pages:
        for line in page.extract_text(keep_blank_chars=False, layout=True).splitlines():
            print(line)

请注意，这是提取文本的代码示例。如果需要进一步处理表格数据，请提供额外的信息。

英文:

i've this pdf, I'm trying to extract table from pdf. Wwhat is the better strategy to get the table? I can not be able to get the value specific on table , for example in the first table , i 've to get [70,75,80,85,90,100,105,110,115,120] and for the second line [0,0,2,6,10,10,10,2,2,0,0]

My final result would be : 411924,KGDHN,MBELT W 40 INT, T.GG SUPREME/SELLERIA, 9643 BEIGE EBONY/COCOA, [70,75,80,85,90,100,105,110,115,120] ,[0,0,2,6,10,10,10,2,2,0,0],42,200.00,8,400.00

with pdfplumber.open(doc) as pdf:
print(pdf.pages)
page = pdf.pages[0]
im = page.to_image(resolution = 400)
text = page.extract_words()
im = im.draw_rects(page.extract_words())
im.show()
# h = open(&#39;empty_test&#39; + &#39;.json&#39;, &quot;w&quot;)
# json.dump(text, h, indent=2, sort_keys=False)
# h.close()

It is a PDF with text. I can extract the text easily, and keep the layout almost the same

for page in pdf.pages:
    for line in page.extract_text(keep_blank_chars=False, layout=True).splitlines():
        print(line)

答案1

得分: 0

以下是翻译好的部分：

The idea is to isolate the smallest area around the values via cropping:

想法是通过裁剪来隔离值周围的最小区域：

You can then use the x0 position of each word as your vertical line.

然后，您可以使用每个单词的 x0 位置作为垂直线。

You can pass the lines to table settings via explicit_vertical_lines which will give back empty strings for the "blank" cells.

您可以通过 explicit_vertical_lines 将这些行传递给 table settings，它将为“空白”单元格返回空字符串。

These are the horizontal lines we can use to split into rows:

这些是我们可以用来分隔成行的水平线：

These are the vertical lines we can use to divide the columns:

这些是我们可以用来分隔列的垂直线：

import itertools...

import itertools
# First thick horizontal line that span > width % of page
product_line = next(
    line for line in page.horizontal_edges 
    if  line['orientation'] == 'h'
    and line['linewidth'] > 1 
    and line['width'] > page.width / 1.25
)
# Search for Description 
description = page.search('Description/Size Quantity Qty Price Value')
has_description = len(description) > 0
# If there is a description we crop there, else we use the line divider
if has_description:
    description = description[0]
    product_area_top = description['bottom'] + 10
    product_area = page.crop(
        (product_line['x0'], product_area_top, product_line['x1'], page.height)
    )
else:
    product_area_top = product_line['top']
    product_area = page.crop(
        (product_line['x0'], product_area_top, product_line['x1'], page.height)
    )
# find horizontal lines for rows
hlines = [
    line['top'] for line in product_area.edges 
    if  line['orientation'] == 'h'
    and line['stroking_color'] == (0, 0, 0) 
    and line['width'] > product_area.width / 1.25
]
# If there is no description on the page we need to add in the top as the first line (in order to extract row 1)
if has_description is False:
    hlines = [product_area_top] + hlines
# Make sure our lines are sorted from top -> bottom
hlines = sorted(set(hlines))
for top, bottom in itertools.pairwise(hlines):
    row = product_area.crop(
        (product_area.bbox[0], top, product_area.width, bottom)
    )
    # vertical lines to create columns
    vlines = [
       line['x0'] for line in row.vertical_edges 
       if line['object_type'] == 'line'
    ]
    # we need to add an end line to extract last column
    vlines = sorted(vlines + [row.width])
    col1 = row.crop((vlines[0], top, vlines[1], bottom))
    col2 = row.crop((vlines[1], top, vlines[2], bottom))
    col3 = row.crop((vlines[2], top, vlines[3], bottom))
    
    lines = col2.extract_text_lines()
    # lines 1-2 are the values, use their positions to crop 
    bbox = lines[1]['x0'], lines[1]['top'], lines[-1]['x1'], lines[-2]['bottom']
    values = col2.crop(bbox)
    # use start of each word as a vertical line edge
    vlines = [word['x0'] for word in values.extract_words()]
    table = values.extract_table(dict(
       explicit_vertical_lines = vlines
    ))
    print(f'{col1.extract_text()=}')
    print(lines[0]['text'], table[0], table[1], sep='\n')
    print(f'{col3.extract_text()=}')

You may be able to extract the other values simply from the text extraction methods, or you could use a similar cropping technique on col1, col3.

您可以尝试从文本提取方法中简单提取其他值，或者您可以在 col1、col3 上使用类似的裁剪技术。

英文:

The idea is to isolate the smallest area around the values via cropping:

</blockquote>

You can then use the x0 position of each word as your vertical line.
<blockquote>

</blockquote>

You can pass the lines to table settings via explicit_vertical_lines which will give back empty strings for the "blank" cells.

col1.extract_text()=&#39;406831 DJ20N\n1000 NERO&#39;
MBELT W.40 GG MAR DOLLAR PIGPRINT
[&#39;60&#39;, &#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;]
[&#39;&#39;, &#39;&#39;, &#39;0&#39;, &#39;0&#39;, &#39;2&#39;, &#39;6&#39;, &#39;10&#39;, &#39;10&#39;, &#39;10&#39;, &#39;2&#39;, &#39;2&#39;, &#39;&#39;, &#39;&#39;]
col3.extract_text()=&#39;42 218.00 9,15\n9,156.0&#39;
col1.extract_text()=&#39;414516 0YA0G\n1000 BLACK&#39;
MBELT W.30 GG MAR. PLUTONE CALF
[&#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;, &#39;135&#39;]
[&#39;0&#39;, &#39;0&#39;, &#39;0&#39;, &#39;2&#39;, &#39;6&#39;, &#39;15&#39;, &#39;15&#39;, &#39;15&#39;, &#39;2&#39;, &#39;2&#39;, &#39;0&#39;, &#39;0&#39;, &#39;&#39;]
col3.extract_text()=&#39;57 205.00 11,6\n11,685.0&#39;
col1.extract_text()=&#39;406831 0YA0G\n1000 BLACK&#39;
MBELT W.40 GG MAR PLUTONE CALF
[&#39;60&#39;, &#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;]
[&#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;2&#39;, &#39;2&#39;, &#39;2&#39;, &#39;2&#39;, &#39;2&#39;, &#39;1&#39;, &#39;&#39;, &#39;&#39;]
col3.extract_text()=&#39;11 218.00 2,39\n2,398.0&#39;
col1.extract_text()=&#39;627055 92TIN\n9769 B.EBONY/NERO&#39;
MBELT W.37GG M.R T.GG SUPREME/PLUTONE CALF
[&#39;60&#39;, &#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;]
[&#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;1&#39;, &#39;&#39;, &#39;&#39;]
col3.extract_text()=&#39;16 244.00 3,90\n3,904.0&#39;

There are various ways you could approach this, but the steps I've used here are:

use Description header line to identify the "top" of the rows if present
split into rows based on the thick/dark horizontal lines
split each row into columns based on vertical lines within row
use the method above for creating a table from column 2

These are the horizontal lines we can use to split into rows:
<blockquote>

</blockquote>

These are the vertical lines we can use to divide the columns:

</blockquote>

import itertools
# First thick horizontal line that span &gt; width % of page
product_line = next(
line for line in page.horizontal_edges 
if  line[&#39;orientation&#39;] == &#39;h&#39;
and line[&#39;linewidth&#39;] &gt; 1 
and line[&#39;width&#39;] &gt; page.width / 1.25
)
# Search for Description 
description = page.search(&#39;Description/Size Quantity Qty Price Value&#39;)
has_description = len(description) &gt; 0
# If there is a description we crop there, else we use the line divider
if has_description:
description = description[0]
product_area_top = description[&#39;bottom&#39;] + 10
product_area = page.crop(
(product_line[&#39;x0&#39;], product_area_top, product_line[&#39;x1&#39;], page.height)
)
else:
product_area_top = product_line[&#39;top&#39;]
product_area = page.crop(
(product_line[&#39;x0&#39;], product_area_top, product_line[&#39;x1&#39;], page.height)
)
# find horizontal lines for rows
hlines = [
line[&#39;top&#39;] for line in product_area.edges 
if  line[&#39;orientation&#39;] == &#39;h&#39;
and line[&#39;stroking_color&#39;] == (0, 0, 0) 
and line[&#39;width&#39;] &gt; product_area.width / 1.25
]
# If there is no description on the page we need to add in the top as the first line (in order to extract row 1)
if has_description is False:
hlines = [product_area_top] + hlines
# Make sure our lines are sorted from top -&gt; bottom
hlines = sorted(set(hlines))
for top, bottom in itertools.pairwise(hlines):
row = product_area.crop(
(product_area.bbox[0], top, product_area.width, bottom)
)
# vertical lines to create columns
vlines = [
line[&#39;x0&#39;] for line in row.vertical_edges 
if line[&#39;object_type&#39;] == &#39;line&#39;
]
# we need to add an end line to extract last column
vlines = sorted(vlines + [row.width])
col1 = row.crop((vlines[0], top, vlines[1], bottom))
col2 = row.crop((vlines[1], top, vlines[2], bottom))
col3 = row.crop((vlines[2], top, vlines[3], bottom))
lines = col2.extract_text_lines()
# lines 1-2 are the values, use their positions to crop 
bbox = lines[1][&#39;x0&#39;], lines[1][&#39;top&#39;], lines[-1][&#39;x1&#39;], lines[-2][&#39;bottom&#39;]
values = col2.crop(bbox)
# use start of each word as a vertical line edge
vlines = [word[&#39;x0&#39;] for word in values.extract_words()]
table = values.extract_table(dict(
explicit_vertical_lines = vlines
))
print(f&#39;{col1.extract_text()=}&#39;)
print(lines[0][&#39;text&#39;], table[0], table[1], sep=&#39;\n&#39;)
print(f&#39;{col3.extract_text()=}&#39;)

You may be able to extract the other values simply from the text extraction methods, or you could use a similar cropping technique on col1, col3.

答案2

得分: 0

以下是您要翻译的内容：

"Here's another possible approach which may be simpler."

"You can match just those specific numbers and save the x0 position so you have the full 'width'."

"You can then create a full 'blank' row which you can merge with each row to ensure each row gets the same number of columns."

"import itertools
import pdfplumber
import re
from operator import itemgetter

pdf = ...
page = ...

nums = [
word for word in page.extract_words(extra_attrs=['size', 'fontname'])
if re.fullmatch('\d+', word['text'])
and word['size'] == 6.75
and word['fontname'].endswith('Bold') is False
]

remove duplicates based on x0

cols = {}
for num in nums:
cols[num['x0']] = num

replace text with blank

blanks = {
col['x0']: col | {'text': ''} for col in cols.values()
}

use 'top' position to group into rows

for x0, row in itertools.groupby(nums, key=itemgetter('top')):
row = blanks | { col['x0']: col for col in row }
print([col['text'] for col in row.values()])"

"[70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, '', '']"
["", "", "2", "6", "10", "10", "10", "2", "2", "", "", "", ""]
["60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120"]
["", "", "0", "0", "2", "6", "10", "10", "10", "2", "2", "", ""]
["65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 135"]
["0", "0", "0", "2", "6", "15", "15", "15", "2", "2", "0", "0", ""]
["60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120"]
["", "", "", "", "", "2", "2", "2", "2", "2", "1", "", ""]
["60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120"]
["", "", "", "", "", "3", "3", "3", "3", "3", "1", "", ""]"

英文:

Here's another possible approach which may be simpler.

You can match just those specific numbers and save the x0 position so you have the full "width".

</blockquote>

You can then create a full "blank" row which you can merge with each row to ensure each row gets the same number of columns.

import itertools
import pdfplumber
import re
from   operator import itemgetter
pdf  = ...
page = ...
nums = [ 
word for word in page.extract_words(extra_attrs=[&#39;size&#39;, &#39;fontname&#39;]) 
if  re.fullmatch(&#39;\d+&#39;, word[&#39;text&#39;]) 
and word[&#39;size&#39;] == 6.75 
and word[&#39;fontname&#39;].endswith(&#39;Bold&#39;) is False 
]
# remove duplicates based on x0
cols = {}
for num in nums:
cols[num[&#39;x0&#39;]] = num
# replace text with blank   
blanks = { 
col[&#39;x0&#39;]: col | {&#39;text&#39;: &#39;&#39;} for col in cols.values() 
}
# use &#39;top&#39; position to group into rows
for x0, row in itertools.groupby(nums, key=itemgetter(&#39;top&#39;)):
row = blanks | { col[&#39;x0&#39;]: col for col in row }
print([col[&#39;text&#39;] for col in row.values()])

[&#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;, &#39;&#39;, &#39;&#39;]
[&#39;&#39;, &#39;&#39;, &#39;2&#39;, &#39;6&#39;, &#39;10&#39;, &#39;10&#39;, &#39;10&#39;, &#39;2&#39;, &#39;2&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;]
[&#39;60&#39;, &#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;]
[&#39;&#39;, &#39;&#39;, &#39;0&#39;, &#39;0&#39;, &#39;2&#39;, &#39;6&#39;, &#39;10&#39;, &#39;10&#39;, &#39;10&#39;, &#39;2&#39;, &#39;2&#39;, &#39;&#39;, &#39;&#39;]
[&#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;, &#39;135&#39;]
[&#39;0&#39;, &#39;0&#39;, &#39;0&#39;, &#39;2&#39;, &#39;6&#39;, &#39;15&#39;, &#39;15&#39;, &#39;15&#39;, &#39;2&#39;, &#39;2&#39;, &#39;0&#39;, &#39;0&#39;, &#39;&#39;]
[&#39;60&#39;, &#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;]
[&#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;2&#39;, &#39;2&#39;, &#39;2&#39;, &#39;2&#39;, &#39;2&#39;, &#39;1&#39;, &#39;&#39;, &#39;&#39;]
[&#39;60&#39;, &#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;]
[&#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;1&#39;, &#39;&#39;, &#39;&#39;]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

pdfplumber Python 提取表格的特定策略设置。

问题

答案1

答案2

remove duplicates based on x0

replace text with blank

use 'top' position to group into rows

如何将 JSON 转换为 Python 函数参数？

从唯一的“ID”中减去“INT”列的“LMP”列，但仅从索引行中减去。

使用SQLAlchemy执行带有命名参数的SQL语句。

使用Pandas根据另一列的条件重置列的值

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论