pdfplumber Python 提取表格的特定策略设置。

huangapple go评论97阅读模式
英文:

pdfplumber python extract_tables setting for the specific strategy

问题

I can help you with the Chinese translation. Here's the translated content:

我有这个PDF文件,我正在尝试从中提取表格。获取表格的更好策略是什么?我无法获取表格中的特定值,例如在第一个表格中,我需要获取[70,75,80,85,90,100,105,110,115,120],而在第二行中需要获取[0,0,2,6,10,10,10,2,2,0,0]。

我的最终结果将是:411924,KGDHN,MBELT W 40 INT,T.GG SUPREME/SELLERIA,9643 BEIGE EBONY/COCOA,[70,75,80,85,90,100,105,110,115,120],[0,0,2,6,10,10,10,2,2,0,0],42,200.00,8,400.00

这是一个包含文本的PDF。我可以轻松提取文本,并保持布局几乎不变。

对于PDF中的每一页,您可以使用以下代码提取文本并保持布局几乎不变:

import pdfplumber

with pdfplumber.open(doc) as pdf:
    for page in pdf.pages:
        for line in page.extract_text(keep_blank_chars=False, layout=True).splitlines():
            print(line)

请注意,这是提取文本的代码示例。如果需要进一步处理表格数据,请提供额外的信息。

英文:

i've this pdf, I'm trying to extract table from pdf. Wwhat is the better strategy to get the table? I can not be able to get the value specific on table , for example in the first table , i 've to get [70,75,80,85,90,100,105,110,115,120] and for the second line [0,0,2,6,10,10,10,2,2,0,0]

My final result would be : 411924,KGDHN,MBELT W 40 INT, T.GG SUPREME/SELLERIA, 9643 BEIGE EBONY/COCOA, [70,75,80,85,90,100,105,110,115,120] ,[0,0,2,6,10,10,10,2,2,0,0],42,200.00,8,400.00

pdfplumber Python 提取表格的特定策略设置。

with pdfplumber.open(doc) as pdf:
print(pdf.pages)
page = pdf.pages[0]
im = page.to_image(resolution = 400)
text = page.extract_words()
im = im.draw_rects(page.extract_words())
im.show()
# h = open('empty_test' + '.json', "w")
# json.dump(text, h, indent=2, sort_keys=False)
# h.close()

pdfplumber Python 提取表格的特定策略设置。

It is a PDF with text. I can extract the text easily, and keep the layout almost the same

for page in pdf.pages:
    for line in page.extract_text(keep_blank_chars=False, layout=True).splitlines():
        print(line)

pdfplumber Python 提取表格的特定策略设置。

答案1

得分: 0

以下是翻译好的部分:

The idea is to isolate the smallest area around the values via cropping:

想法是通过裁剪来隔离值周围的最小区域:

You can then use the x0 position of each word as your vertical line.

然后,您可以使用每个单词的 x0 位置作为垂直线。

You can pass the lines to table settings via explicit_vertical_lines which will give back empty strings for the "blank" cells.

您可以通过 explicit_vertical_lines 将这些行传递给 table settings,它将为“空白”单元格返回空字符串。

These are the horizontal lines we can use to split into rows:

这些是我们可以用来分隔成行的水平线:

These are the vertical lines we can use to divide the columns:

这些是我们可以用来分隔列的垂直线:

import itertools...

import itertools

# First thick horizontal line that span > width % of page
product_line = next(
    line for line in page.horizontal_edges 
    if  line['orientation'] == 'h'
    and line['linewidth'] > 1 
    and line['width'] > page.width / 1.25
)

# Search for Description 
description = page.search('Description/Size Quantity Qty Price Value')
has_description = len(description) > 0

# If there is a description we crop there, else we use the line divider
if has_description:
    description = description[0]
    product_area_top = description['bottom'] + 10
    product_area = page.crop(
        (product_line['x0'], product_area_top, product_line['x1'], page.height)
    )
else:
    product_area_top = product_line['top']
    product_area = page.crop(
        (product_line['x0'], product_area_top, product_line['x1'], page.height)
    )

# find horizontal lines for rows
hlines = [
    line['top'] for line in product_area.edges 
    if  line['orientation'] == 'h'
    and line['stroking_color'] == (0, 0, 0) 
    and line['width'] > product_area.width / 1.25
]

# If there is no description on the page we need to add in the top as the first line (in order to extract row 1)
if has_description is False:
    hlines = [product_area_top] + hlines

# Make sure our lines are sorted from top -> bottom
hlines = sorted(set(hlines))

for top, bottom in itertools.pairwise(hlines):
    row = product_area.crop(
        (product_area.bbox[0], top, product_area.width, bottom)
    )

    # vertical lines to create columns
    vlines = [
       line['x0'] for line in row.vertical_edges 
       if line['object_type'] == 'line'
    ]

    # we need to add an end line to extract last column
    vlines = sorted(vlines + [row.width])

    col1 = row.crop((vlines[0], top, vlines[1], bottom))
    col2 = row.crop((vlines[1], top, vlines[2], bottom))
    col3 = row.crop((vlines[2], top, vlines[3], bottom))
    
    lines = col2.extract_text_lines()

    # lines 1-2 are the values, use their positions to crop 
    bbox = lines[1]['x0'], lines[1]['top'], lines[-1]['x1'], lines[-2]['bottom']

    values = col2.crop(bbox)

    # use start of each word as a vertical line edge
    vlines = [word['x0'] for word in values.extract_words()]

    table = values.extract_table(dict(
       explicit_vertical_lines = vlines
    ))

    print(f'{col1.extract_text()=}')
    print(lines[0]['text'], table[0], table[1], sep='\n')
    print(f'{col3.extract_text()=}')

You may be able to extract the other values simply from the text extraction methods, or you could use a similar cropping technique on col1, col3.

您可以尝试从文本提取方法中简单提取其他值,或者您可以在 col1col3 上使用类似的裁剪技术。

英文:

The idea is to isolate the smallest area around the values via cropping:

<blockquote>

pdfplumber Python 提取表格的特定策略设置。

</blockquote>

You can then use the x0 position of each word as your vertical line.
<blockquote>

pdfplumber Python 提取表格的特定策略设置。

</blockquote>

You can pass the lines to table settings via explicit_vertical_lines which will give back empty strings for the "blank" cells.

col1.extract_text()=&#39;406831 DJ20N\n1000 NERO&#39;
MBELT W.40 GG MAR DOLLAR PIGPRINT
[&#39;60&#39;, &#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;]
[&#39;&#39;, &#39;&#39;, &#39;0&#39;, &#39;0&#39;, &#39;2&#39;, &#39;6&#39;, &#39;10&#39;, &#39;10&#39;, &#39;10&#39;, &#39;2&#39;, &#39;2&#39;, &#39;&#39;, &#39;&#39;]
col3.extract_text()=&#39;42 218.00 9,15\n9,156.0&#39;
col1.extract_text()=&#39;414516 0YA0G\n1000 BLACK&#39;
MBELT W.30 GG MAR. PLUTONE CALF
[&#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;, &#39;135&#39;]
[&#39;0&#39;, &#39;0&#39;, &#39;0&#39;, &#39;2&#39;, &#39;6&#39;, &#39;15&#39;, &#39;15&#39;, &#39;15&#39;, &#39;2&#39;, &#39;2&#39;, &#39;0&#39;, &#39;0&#39;, &#39;&#39;]
col3.extract_text()=&#39;57 205.00 11,6\n11,685.0&#39;
col1.extract_text()=&#39;406831 0YA0G\n1000 BLACK&#39;
MBELT W.40 GG MAR PLUTONE CALF
[&#39;60&#39;, &#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;]
[&#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;2&#39;, &#39;2&#39;, &#39;2&#39;, &#39;2&#39;, &#39;2&#39;, &#39;1&#39;, &#39;&#39;, &#39;&#39;]
col3.extract_text()=&#39;11 218.00 2,39\n2,398.0&#39;
col1.extract_text()=&#39;627055 92TIN\n9769 B.EBONY/NERO&#39;
MBELT W.37GG M.R T.GG SUPREME/PLUTONE CALF
[&#39;60&#39;, &#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;]
[&#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;1&#39;, &#39;&#39;, &#39;&#39;]
col3.extract_text()=&#39;16 244.00 3,90\n3,904.0&#39;

There are various ways you could approach this, but the steps I've used here are:

  • use Description header line to identify the "top" of the rows if present
  • split into rows based on the thick/dark horizontal lines
  • split each row into columns based on vertical lines within row
  • use the method above for creating a table from column 2

These are the horizontal lines we can use to split into rows:
<blockquote>

pdfplumber Python 提取表格的特定策略设置。

</blockquote>

These are the vertical lines we can use to divide the columns:

<blockquote>

pdfplumber Python 提取表格的特定策略设置。

</blockquote>

import itertools
# First thick horizontal line that span &gt; width % of page
product_line = next(
line for line in page.horizontal_edges 
if  line[&#39;orientation&#39;] == &#39;h&#39;
and line[&#39;linewidth&#39;] &gt; 1 
and line[&#39;width&#39;] &gt; page.width / 1.25
)
# Search for Description 
description = page.search(&#39;Description/Size Quantity Qty Price Value&#39;)
has_description = len(description) &gt; 0
# If there is a description we crop there, else we use the line divider
if has_description:
description = description[0]
product_area_top = description[&#39;bottom&#39;] + 10
product_area = page.crop(
(product_line[&#39;x0&#39;], product_area_top, product_line[&#39;x1&#39;], page.height)
)
else:
product_area_top = product_line[&#39;top&#39;]
product_area = page.crop(
(product_line[&#39;x0&#39;], product_area_top, product_line[&#39;x1&#39;], page.height)
)
# find horizontal lines for rows
hlines = [
line[&#39;top&#39;] for line in product_area.edges 
if  line[&#39;orientation&#39;] == &#39;h&#39;
and line[&#39;stroking_color&#39;] == (0, 0, 0) 
and line[&#39;width&#39;] &gt; product_area.width / 1.25
]
# If there is no description on the page we need to add in the top as the first line (in order to extract row 1)
if has_description is False:
hlines = [product_area_top] + hlines
# Make sure our lines are sorted from top -&gt; bottom
hlines = sorted(set(hlines))
for top, bottom in itertools.pairwise(hlines):
row = product_area.crop(
(product_area.bbox[0], top, product_area.width, bottom)
)
# vertical lines to create columns
vlines = [
line[&#39;x0&#39;] for line in row.vertical_edges 
if line[&#39;object_type&#39;] == &#39;line&#39;
]
# we need to add an end line to extract last column
vlines = sorted(vlines + [row.width])
col1 = row.crop((vlines[0], top, vlines[1], bottom))
col2 = row.crop((vlines[1], top, vlines[2], bottom))
col3 = row.crop((vlines[2], top, vlines[3], bottom))
lines = col2.extract_text_lines()
# lines 1-2 are the values, use their positions to crop 
bbox = lines[1][&#39;x0&#39;], lines[1][&#39;top&#39;], lines[-1][&#39;x1&#39;], lines[-2][&#39;bottom&#39;]
values = col2.crop(bbox)
# use start of each word as a vertical line edge
vlines = [word[&#39;x0&#39;] for word in values.extract_words()]
table = values.extract_table(dict(
explicit_vertical_lines = vlines
))
print(f&#39;{col1.extract_text()=}&#39;)
print(lines[0][&#39;text&#39;], table[0], table[1], sep=&#39;\n&#39;)
print(f&#39;{col3.extract_text()=}&#39;)

You may be able to extract the other values simply from the text extraction methods, or you could use a similar cropping technique on col1, col3.

答案2

得分: 0

以下是您要翻译的内容:

"Here's another possible approach which may be simpler."

"You can match just those specific numbers and save the x0 position so you have the full 'width'."

"You can then create a full 'blank' row which you can merge with each row to ensure each row gets the same number of columns."

"import itertools
import pdfplumber
import re
from operator import itemgetter

pdf = ...
page = ...

nums = [
word for word in page.extract_words(extra_attrs=['size', 'fontname'])
if re.fullmatch('\d+', word['text'])
and word['size'] == 6.75
and word['fontname'].endswith('Bold') is False
]

remove duplicates based on x0

cols = {}
for num in nums:
cols[num['x0']] = num

replace text with blank

blanks = {
col['x0']: col | {'text': ''} for col in cols.values()
}

use 'top' position to group into rows

for x0, row in itertools.groupby(nums, key=itemgetter('top')):
row = blanks | { col['x0']: col for col in row }
print([col['text'] for col in row.values()])"

"[70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, '', '']"
["", "", "2", "6", "10", "10", "10", "2", "2", "", "", "", ""]
["60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120"]
["", "", "0", "0", "2", "6", "10", "10", "10", "2", "2", "", ""]
["65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 135"]
["0", "0", "0", "2", "6", "15", "15", "15", "2", "2", "0", "0", ""]
["60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120"]
["", "", "", "", "", "2", "2", "2", "2", "2", "1", "", ""]
["60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120"]
["", "", "", "", "", "3", "3", "3", "3", "3", "1", "", ""]"

英文:

Here's another possible approach which may be simpler.

You can match just those specific numbers and save the x0 position so you have the full "width".

<blockquote>

pdfplumber Python 提取表格的特定策略设置。

</blockquote>

You can then create a full "blank" row which you can merge with each row to ensure each row gets the same number of columns.

import itertools
import pdfplumber
import re
from   operator import itemgetter
pdf  = ...
page = ...
nums = [ 
word for word in page.extract_words(extra_attrs=[&#39;size&#39;, &#39;fontname&#39;]) 
if  re.fullmatch(&#39;\d+&#39;, word[&#39;text&#39;]) 
and word[&#39;size&#39;] == 6.75 
and word[&#39;fontname&#39;].endswith(&#39;Bold&#39;) is False 
]
# remove duplicates based on x0
cols = {}
for num in nums:
cols[num[&#39;x0&#39;]] = num
# replace text with blank   
blanks = { 
col[&#39;x0&#39;]: col | {&#39;text&#39;: &#39;&#39;} for col in cols.values() 
}
# use &#39;top&#39; position to group into rows
for x0, row in itertools.groupby(nums, key=itemgetter(&#39;top&#39;)):
row = blanks | { col[&#39;x0&#39;]: col for col in row }
print([col[&#39;text&#39;] for col in row.values()])
[&#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;, &#39;&#39;, &#39;&#39;]
[&#39;&#39;, &#39;&#39;, &#39;2&#39;, &#39;6&#39;, &#39;10&#39;, &#39;10&#39;, &#39;10&#39;, &#39;2&#39;, &#39;2&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;]
[&#39;60&#39;, &#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;]
[&#39;&#39;, &#39;&#39;, &#39;0&#39;, &#39;0&#39;, &#39;2&#39;, &#39;6&#39;, &#39;10&#39;, &#39;10&#39;, &#39;10&#39;, &#39;2&#39;, &#39;2&#39;, &#39;&#39;, &#39;&#39;]
[&#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;, &#39;135&#39;]
[&#39;0&#39;, &#39;0&#39;, &#39;0&#39;, &#39;2&#39;, &#39;6&#39;, &#39;15&#39;, &#39;15&#39;, &#39;15&#39;, &#39;2&#39;, &#39;2&#39;, &#39;0&#39;, &#39;0&#39;, &#39;&#39;]
[&#39;60&#39;, &#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;]
[&#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;2&#39;, &#39;2&#39;, &#39;2&#39;, &#39;2&#39;, &#39;2&#39;, &#39;1&#39;, &#39;&#39;, &#39;&#39;]
[&#39;60&#39;, &#39;65&#39;, &#39;70&#39;, &#39;75&#39;, &#39;80&#39;, &#39;85&#39;, &#39;90&#39;, &#39;95&#39;, &#39;100&#39;, &#39;105&#39;, &#39;110&#39;, &#39;115&#39;, &#39;120&#39;]
[&#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;3&#39;, &#39;1&#39;, &#39;&#39;, &#39;&#39;]

huangapple
  • 本文由 发表于 2023年6月30日 04:42:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76584489.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定