英文:
pdfplumber python extract_tables setting for the specific strategy
问题
I can help you with the Chinese translation. Here's the translated content:
我有这个PDF文件,我正在尝试从中提取表格。获取表格的更好策略是什么?我无法获取表格中的特定值,例如在第一个表格中,我需要获取[70,75,80,85,90,100,105,110,115,120],而在第二行中需要获取[0,0,2,6,10,10,10,2,2,0,0]。
我的最终结果将是:411924,KGDHN,MBELT W 40 INT,T.GG SUPREME/SELLERIA,9643 BEIGE EBONY/COCOA,[70,75,80,85,90,100,105,110,115,120],[0,0,2,6,10,10,10,2,2,0,0],42,200.00,8,400.00
这是一个包含文本的PDF。我可以轻松提取文本,并保持布局几乎不变。
对于PDF中的每一页,您可以使用以下代码提取文本并保持布局几乎不变:
import pdfplumber
with pdfplumber.open(doc) as pdf:
for page in pdf.pages:
for line in page.extract_text(keep_blank_chars=False, layout=True).splitlines():
print(line)
请注意,这是提取文本的代码示例。如果需要进一步处理表格数据,请提供额外的信息。
英文:
i've this pdf, I'm trying to extract table from pdf. Wwhat is the better strategy to get the table? I can not be able to get the value specific on table , for example in the first table , i 've to get [70,75,80,85,90,100,105,110,115,120] and for the second line [0,0,2,6,10,10,10,2,2,0,0]
My final result would be : 411924,KGDHN,MBELT W 40 INT, T.GG SUPREME/SELLERIA, 9643 BEIGE EBONY/COCOA, [70,75,80,85,90,100,105,110,115,120] ,[0,0,2,6,10,10,10,2,2,0,0],42,200.00,8,400.00
with pdfplumber.open(doc) as pdf:
print(pdf.pages)
page = pdf.pages[0]
im = page.to_image(resolution = 400)
text = page.extract_words()
im = im.draw_rects(page.extract_words())
im.show()
# h = open('empty_test' + '.json', "w")
# json.dump(text, h, indent=2, sort_keys=False)
# h.close()
It is a PDF with text. I can extract the text easily, and keep the layout almost the same
for page in pdf.pages:
for line in page.extract_text(keep_blank_chars=False, layout=True).splitlines():
print(line)
答案1
得分: 0
以下是翻译好的部分:
The idea is to isolate the smallest area around the values via cropping:
想法是通过裁剪来隔离值周围的最小区域:
You can then use the x0
position of each word as your vertical line.
然后,您可以使用每个单词的 x0
位置作为垂直线。
You can pass the lines to table settings via explicit_vertical_lines
which will give back empty strings for the "blank" cells.
您可以通过 explicit_vertical_lines
将这些行传递给 table settings,它将为“空白”单元格返回空字符串。
These are the horizontal lines we can use to split into rows:
这些是我们可以用来分隔成行的水平线:
These are the vertical lines we can use to divide the columns:
这些是我们可以用来分隔列的垂直线:
import itertools...
import itertools
# First thick horizontal line that span > width % of page
product_line = next(
line for line in page.horizontal_edges
if line['orientation'] == 'h'
and line['linewidth'] > 1
and line['width'] > page.width / 1.25
)
# Search for Description
description = page.search('Description/Size Quantity Qty Price Value')
has_description = len(description) > 0
# If there is a description we crop there, else we use the line divider
if has_description:
description = description[0]
product_area_top = description['bottom'] + 10
product_area = page.crop(
(product_line['x0'], product_area_top, product_line['x1'], page.height)
)
else:
product_area_top = product_line['top']
product_area = page.crop(
(product_line['x0'], product_area_top, product_line['x1'], page.height)
)
# find horizontal lines for rows
hlines = [
line['top'] for line in product_area.edges
if line['orientation'] == 'h'
and line['stroking_color'] == (0, 0, 0)
and line['width'] > product_area.width / 1.25
]
# If there is no description on the page we need to add in the top as the first line (in order to extract row 1)
if has_description is False:
hlines = [product_area_top] + hlines
# Make sure our lines are sorted from top -> bottom
hlines = sorted(set(hlines))
for top, bottom in itertools.pairwise(hlines):
row = product_area.crop(
(product_area.bbox[0], top, product_area.width, bottom)
)
# vertical lines to create columns
vlines = [
line['x0'] for line in row.vertical_edges
if line['object_type'] == 'line'
]
# we need to add an end line to extract last column
vlines = sorted(vlines + [row.width])
col1 = row.crop((vlines[0], top, vlines[1], bottom))
col2 = row.crop((vlines[1], top, vlines[2], bottom))
col3 = row.crop((vlines[2], top, vlines[3], bottom))
lines = col2.extract_text_lines()
# lines 1-2 are the values, use their positions to crop
bbox = lines[1]['x0'], lines[1]['top'], lines[-1]['x1'], lines[-2]['bottom']
values = col2.crop(bbox)
# use start of each word as a vertical line edge
vlines = [word['x0'] for word in values.extract_words()]
table = values.extract_table(dict(
explicit_vertical_lines = vlines
))
print(f'{col1.extract_text()=}')
print(lines[0]['text'], table[0], table[1], sep='\n')
print(f'{col3.extract_text()=}')
You may be able to extract the other values simply from the text extraction methods, or you could use a similar cropping technique on col1
, col3
.
您可以尝试从文本提取方法中简单提取其他值,或者您可以在 col1
、col3
上使用类似的裁剪技术。
英文:
The idea is to isolate the smallest area around the values via cropping:
<blockquote>
</blockquote>
You can then use the x0
position of each word as your vertical line.
<blockquote>
</blockquote>
You can pass the lines to table settings via explicit_vertical_lines
which will give back empty strings for the "blank" cells.
col1.extract_text()='406831 DJ20N\n1000 NERO'
MBELT W.40 GG MAR DOLLAR PIGPRINT
['60', '65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120']
['', '', '0', '0', '2', '6', '10', '10', '10', '2', '2', '', '']
col3.extract_text()='42 218.00 9,15\n9,156.0'
col1.extract_text()='414516 0YA0G\n1000 BLACK'
MBELT W.30 GG MAR. PLUTONE CALF
['65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120', '135']
['0', '0', '0', '2', '6', '15', '15', '15', '2', '2', '0', '0', '']
col3.extract_text()='57 205.00 11,6\n11,685.0'
col1.extract_text()='406831 0YA0G\n1000 BLACK'
MBELT W.40 GG MAR PLUTONE CALF
['60', '65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120']
['', '', '', '', '', '2', '2', '2', '2', '2', '1', '', '']
col3.extract_text()='11 218.00 2,39\n2,398.0'
col1.extract_text()='627055 92TIN\n9769 B.EBONY/NERO'
MBELT W.37GG M.R T.GG SUPREME/PLUTONE CALF
['60', '65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120']
['', '', '', '', '', '3', '3', '3', '3', '3', '1', '', '']
col3.extract_text()='16 244.00 3,90\n3,904.0'
There are various ways you could approach this, but the steps I've used here are:
- use Description header line to identify the "top" of the rows if present
- split into rows based on the thick/dark horizontal lines
- split each row into columns based on vertical lines within row
- use the method above for creating a table from column 2
These are the horizontal lines we can use to split into rows:
<blockquote>
</blockquote>
These are the vertical lines we can use to divide the columns:
<blockquote>
</blockquote>
import itertools
# First thick horizontal line that span > width % of page
product_line = next(
line for line in page.horizontal_edges
if line['orientation'] == 'h'
and line['linewidth'] > 1
and line['width'] > page.width / 1.25
)
# Search for Description
description = page.search('Description/Size Quantity Qty Price Value')
has_description = len(description) > 0
# If there is a description we crop there, else we use the line divider
if has_description:
description = description[0]
product_area_top = description['bottom'] + 10
product_area = page.crop(
(product_line['x0'], product_area_top, product_line['x1'], page.height)
)
else:
product_area_top = product_line['top']
product_area = page.crop(
(product_line['x0'], product_area_top, product_line['x1'], page.height)
)
# find horizontal lines for rows
hlines = [
line['top'] for line in product_area.edges
if line['orientation'] == 'h'
and line['stroking_color'] == (0, 0, 0)
and line['width'] > product_area.width / 1.25
]
# If there is no description on the page we need to add in the top as the first line (in order to extract row 1)
if has_description is False:
hlines = [product_area_top] + hlines
# Make sure our lines are sorted from top -> bottom
hlines = sorted(set(hlines))
for top, bottom in itertools.pairwise(hlines):
row = product_area.crop(
(product_area.bbox[0], top, product_area.width, bottom)
)
# vertical lines to create columns
vlines = [
line['x0'] for line in row.vertical_edges
if line['object_type'] == 'line'
]
# we need to add an end line to extract last column
vlines = sorted(vlines + [row.width])
col1 = row.crop((vlines[0], top, vlines[1], bottom))
col2 = row.crop((vlines[1], top, vlines[2], bottom))
col3 = row.crop((vlines[2], top, vlines[3], bottom))
lines = col2.extract_text_lines()
# lines 1-2 are the values, use their positions to crop
bbox = lines[1]['x0'], lines[1]['top'], lines[-1]['x1'], lines[-2]['bottom']
values = col2.crop(bbox)
# use start of each word as a vertical line edge
vlines = [word['x0'] for word in values.extract_words()]
table = values.extract_table(dict(
explicit_vertical_lines = vlines
))
print(f'{col1.extract_text()=}')
print(lines[0]['text'], table[0], table[1], sep='\n')
print(f'{col3.extract_text()=}')
You may be able to extract the other values simply from the text extraction methods, or you could use a similar cropping technique on col1
, col3
.
答案2
得分: 0
以下是您要翻译的内容:
"Here's another possible approach which may be simpler."
"You can match just those specific numbers and save the x0
position so you have the full 'width'."
"You can then create a full 'blank' row which you can merge with each row to ensure each row gets the same number of columns."
"import itertools
import pdfplumber
import re
from operator import itemgetter
pdf = ...
page = ...
nums = [
word for word in page.extract_words(extra_attrs=['size', 'fontname'])
if re.fullmatch('\d+', word['text'])
and word['size'] == 6.75
and word['fontname'].endswith('Bold') is False
]
remove duplicates based on x0
cols = {}
for num in nums:
cols[num['x0']] = num
replace text with blank
blanks = {
col['x0']: col | {'text': ''} for col in cols.values()
}
use 'top' position to group into rows
for x0, row in itertools.groupby(nums, key=itemgetter('top')):
row = blanks | { col['x0']: col for col in row }
print([col['text'] for col in row.values()])"
"[70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, '', '']"
["", "", "2", "6", "10", "10", "10", "2", "2", "", "", "", ""]
["60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120"]
["", "", "0", "0", "2", "6", "10", "10", "10", "2", "2", "", ""]
["65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 135"]
["0", "0", "0", "2", "6", "15", "15", "15", "2", "2", "0", "0", ""]
["60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120"]
["", "", "", "", "", "2", "2", "2", "2", "2", "1", "", ""]
["60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120"]
["", "", "", "", "", "3", "3", "3", "3", "3", "1", "", ""]"
英文:
Here's another possible approach which may be simpler.
You can match just those specific numbers and save the x0
position so you have the full "width".
<blockquote>
</blockquote>
You can then create a full "blank" row which you can merge with each row to ensure each row gets the same number of columns.
import itertools
import pdfplumber
import re
from operator import itemgetter
pdf = ...
page = ...
nums = [
word for word in page.extract_words(extra_attrs=['size', 'fontname'])
if re.fullmatch('\d+', word['text'])
and word['size'] == 6.75
and word['fontname'].endswith('Bold') is False
]
# remove duplicates based on x0
cols = {}
for num in nums:
cols[num['x0']] = num
# replace text with blank
blanks = {
col['x0']: col | {'text': ''} for col in cols.values()
}
# use 'top' position to group into rows
for x0, row in itertools.groupby(nums, key=itemgetter('top')):
row = blanks | { col['x0']: col for col in row }
print([col['text'] for col in row.values()])
['70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120', '', '']
['', '', '2', '6', '10', '10', '10', '2', '2', '', '', '', '']
['60', '65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120']
['', '', '0', '0', '2', '6', '10', '10', '10', '2', '2', '', '']
['65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120', '135']
['0', '0', '0', '2', '6', '15', '15', '15', '2', '2', '0', '0', '']
['60', '65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120']
['', '', '', '', '', '2', '2', '2', '2', '2', '1', '', '']
['60', '65', '70', '75', '80', '85', '90', '95', '100', '105', '110', '115', '120']
['', '', '', '', '', '3', '3', '3', '3', '3', '1', '', '']
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论