如何从PDF中提取表格数据?

huangapple go评论98阅读模式
英文:

How to extract table data from PDF?

问题

我的目标是将一个.pdf文件处理到内存中。问题是输出忽略了表格,导致字符串连接的结果。

使用的库:https://github.com/ledongthuc/pdf

代码如下:

package main

import (
	"bytes"
	"fmt"

	"github.com/ledongthuc/pdf"
)

func main() {
	pdf.DebugOn = true
	content, err := readPdf("accountnumberJul2022.pdf") // 读取本地pdf文件
	if err != nil {
		panic(err)
	}
	fmt.Println(content)
	return
}

func readPdf(path string) (string, error) {
	f, r, err := pdf.Open(path)
	// 记得关闭文件
	defer f.Close()
	if err != nil {
		return "", err
	}
	var buf bytes.Buffer
	b, err := r.GetPlainText()
	if err != nil {
		return "", err
	}
	buf.ReadFrom(b)
	return buf.String(), nil
}

PDF文件:https://drive.google.com/file/d/14RFll7pZ8_J8ua-NDrw31QHe-4N16IJL/view?usp=sharing

输出结果如下:

日期描述借方金额余额01/07期初余额1,000.0002/07TRSFE-BANKINGDB0207/DBXOA/SB24313/Q0321XXXXX56LAWSON1 DB999.0003/07TRSFE-BANKINGDB0307/DBXOA/SB24313/Q0321XXXXX56LAWSON2 DB997.0004/07TRSFE-BANKINGDB0407/DBXOA/SB24313/Q0321XXXXX56LAWSON3 DB994.0005/07TRSFE-BANKINGDB0507/DBXOA/SB24313/Q0321XXXXX56LAWSON4 DB990.0006/07TRSFE-BANKINGDB0607/DBXOA/SB24313/Q0321XXXXX56LAWSON5 DB985.0007/07TRSFE-BANKINGDB0707/DBXOA/SB24313/Q0321XXXXX56LAWSON6 DB979.0008/07TRSFE-BANKINGDB0807/DBXOA/SB24313/Q0321XXXXX56LAWSON7 DB972.0009/07TRSFE-BANKINGDB0907/DBXOA/SB24313/Q0321XXXXX56LAWSON8 DB964.00续下一页日期描述借方金额余额10/07TRSFE-BANKINGDB1007/DBXOA/SB24313/Q0321XXXXX56LAWSON9 DB955.0011/07TRSFE-BANKINGDB1107/DBXOA/SB24313/Q0321XXXXX56LAWSON10 DB945.0012/07TRSFE-BANKINGDB1207/DBXOA/SB24313/Q0321XXXXX56LAWSON11 DB934.0013/07TRSFE-BANKINGDB1307/DBXOA/SB24313/Q0321XXXXX56LAWSON12 DB922.0014/07TRSFE-BANKINGDB1407/DBXOA/SB24313/Q0321XXXXX56LAWSON13 DB909.0015/07TRSFE-BANKINGDB1507/DBXOA/SB24313/Q0321XXXXX56LAWSON14 DB895.0016/07利息1517/07利息税1909.00

我尝试过:

  1. 我也尝试了按行分组读取文本的示例,并将fmt.Println(word.S)更改为fmt.Print(word.S)

然而,输出结果更加难以阅读:

>>> 行:  0
ATEDESCRIPTIONBRENTRYBALANCEBe00.ginning Balance1,000.00469NOSWTRSFE-BANKINGDB0207/DBXOA/SB24313/Q0321XXXXX56LAWSONAL659X99.00XXXXXTRSFE-BANKINGDB0307/DBXOA/SB24313/Q0321XXXXX56LAWSON1230997.00Q/313TRSFE-BANKINGDB0407/DBXOA/SB24313/Q0321XXXXX56LAWSON42BS994.00/AOXBTRSFE-BANKINGDB0507/DBXOA/SB24313/Q0321XXXXD56LAWSOND/70990.0090BDGTRSFE-BANKINGDB0607/DBXOA/SB24313/Q0321XXXXX56LAWSONNIKN985.00AB-EFTRSFE-BANKINGDB0707/DBXOA/SB24313/Q0321XXXXX56LAWSONSRT0979.000.279TRSFE-BANKINGDB0807/DBXOA/SB24313/Q0321XXXXX56LAWSON701/70080009/0770270/70/0770/6003/070/507/4/03 DBBD 2B4 DB5D 1 DB68 DB DBBD 7oontinued Cn next page>>> 行:  0
TATEDESCRIPTIONBRENTRYBALANCE00.90TRSFE-BANKINGDB1007/DBXOA/SB24313/Q0321XXXXX56LAWSON9XAT955.00TSERETRSFE-BANKINGDB1107/DBXOA/SB24313/Q0321XXXXX56LAWSONTNITS945.00ERETNTRSFE-BANKINGDB1207/DBXOA/SB24313/Q0321XXXXX56LAWSONI00.5934.0098NO/DSRSFE-BANKINGDB1307/DBXOA/SB24313/Q0321XXXXX56LAWSONWAL65922.00XXXXXTRSFE-BANKINGDB1407DBXOA/SB24313/Q0321XXXXX56LAWSON1230Q909.00/3134TRSFE-BANKINGDB1507/DBXOA/SB20/5110/0770/410707/60170//7137712/107/1141B3 BDBDDBD 1 0112 BD 11D9B 511
英文:

My goal is to process a .pdf file to the memory. The problem is the output ignore the table, which results of concanated string.

The library used: https://github.com/ledongthuc/pdf

The code

package main

import (
	"bytes"
	"fmt"

	"github.com/ledongthuc/pdf"
)

func main() {
	pdf.DebugOn = true
	content, err := readPdf("accountnumberJul2022.pdf") // Read local pdf file
	if err != nil {
		panic(err)
	}
	fmt.Println(content)
	return
}

func readPdf(path string) (string, error) {
	f, r, err := pdf.Open(path)
	// remember close file
	defer f.Close()
	if err != nil {
		return "", err
	}
	var buf bytes.Buffer
	b, err := r.GetPlainText()
	if err != nil {
		return "", err
	}
	buf.ReadFrom(b)
	return buf.String(), nil
}

The PDF file: https://drive.google.com/file/d/14RFll7pZ8_J8ua-NDrw31QHe-4N16IJL/view?usp=sharing

The output

DATEDESCRIPTIONBRENTRYBALANCE01/07Beginning Balance1,000.0002/07TRSFE-BANKINGDB0207/DBXOA/SB24313/Q0321XXXXX56LAWSON1 DB999.0003/07TRSFE-BANKINGDB0307/DBXOA/SB24313/Q0321XXXXX56LAWSON2 DB997.0004/07TRSFE-BANKINGDB0407/DBXOA/SB24313/Q0321XXXXX56LAWSON3 DB994.0005/07TRSFE-BANKINGDB0507/DBXOA/SB24313/Q0321XXXXX56LAWSON4 DB990.0006/07TRSFE-BANKINGDB0607/DBXOA/SB24313/Q0321XXXXX56LAWSON5 DB985.0007/07TRSFE-BANKINGDB0707/DBXOA/SB24313/Q0321XXXXX56LAWSON6 DB979.0008/07TRSFE-BANKINGDB0807/DBXOA/SB24313/Q0321XXXXX56LAWSON7 DB972.0009/07TRSFE-BANKINGDB0907/DBXOA/SB24313/Q0321XXXXX56LAWSON8 DB964.00Continued on next pageDATEDESCRIPTIONBRENTRYBALANCE10/07TRSFE-BANKINGDB1007/DBXOA/SB24313/Q0321XXXXX56LAWSON9 DB955.0011/07TRSFE-BANKINGDB1107/DBXOA/SB24313/Q0321XXXXX56LAWSON10 DB945.0012/07TRSFE-BANKINGDB1207/DBXOA/SB24313/Q0321XXXXX56LAWSON11 DB934.0013/07TRSFE-BANKINGDB1307/DBXOA/SB24313/Q0321XXXXX56LAWSON12 DB922.0014/07TRSFE-BANKINGDB1407/DBXOA/SB24313/Q0321XXXXX56LAWSON13 DB909.0015/07TRSFE-BANKINGDB1507/DBXOA/SB24313/Q0321XXXXX56LAWSON14 DB895.0016/07INTEREST1517/07INTERESTTAX1909.00

What I've tried:

  1. I also tried the example for Read text grouped by rows and change the fmt.Println(word.S) to fmt.Print(word.S)

However, the output is more unreadable

>>>> row:  0
ATEDESCRIPTIONBRENTRYBALANCEBe00.ginning Balance1,000.00469NOSWTRSFE-BANKINGDB0207/DBXOA/SB24313/Q0321XXXXX56LAWSONAL659X99.00XXXXXTRSFE-BANKINGDB0307/DBXOA/SB24313/Q0321XXXXX56LAWSON1230997.00Q/313TRSFE-BANKINGDB0407/DBXOA/SB24313/Q0321XXXXX56LAWSON42BS994.00/AOXBTRSFE-BANKINGDB0507/DBXOA/SB24313/Q0321XXXXD56LAWSOND/70990.0090BDGTRSFE-BANKINGDB0607/DBXOA/SB24313/Q0321XXXXX56LAWSONNIKN985.00AB-EFTRSFE-BANKINGDB0707/DBXOA/SB24313/Q0321XXXXX56LAWSONSRT0979.000.279TRSFE-BANKINGDB0807/DBXOA/SB24313/Q0321XXXXX56LAWSON701/70080009/0770270/70/0770/6003/070/507/4/03 DBBD 2B4 DB5D 1 DB68 DB DBBD 7oontinued Cn next page>>>> row:  0
TATEDESCRIPTIONBRENTRYBALANCE00.90TRSFE-BANKINGDB1007/DBXOA/SB24313/Q0321XXXXX56LAWSON9XAT955.00TSERETRSFE-BANKINGDB1107/DBXOA/SB24313/Q0321XXXXX56LAWSONTNITS945.00ERETNTRSFE-BANKINGDB1207/DBXOA/SB24313/Q0321XXXXX56LAWSONI00.5934.0098NO/DSRSFE-BANKINGDB1307/DBXOA/SB24313/Q0321XXXXX56LAWSONWAL65922.00XXXXXTRSFE-BANKINGDB1407DBXOA/SB24313/Q0321XXXXX56LAWSON1230Q909.00/3134TRSFE-BANKINGDB1507/DBXOA/SB20/5110/0770/410707/60170//7137712/107/1141B3 BDBDDBD 1 0112 BD 11D9B 511

答案1

得分: 2

PDF文件的设计初衷并不是为了机器可读。而且PDF文件的结构可能因文件而异。因此,我怀疑是否有任何解析任意PDF文件的“可靠解决方案”。PDF文件并不像原始电子表格文件那样“结构化”,PDF文件更像是矢量图形,它只包含位置和命令,用于在正确的位置绘制字符,而不包含字符本身。

在你的情况下,你的特定PDF文件似乎结构较好。使用qpdf提取内容可以得到以下结果:

# 提取的部分PDF内容,我添加了注释(#)。
BT
/F4 14.666667 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <0027> Tj  # D
10.5842743 0 Td <0024> Tj   # A
8.6870575 0 Td <0037> Tj    # T
8.9526215 0 Td <0028> Tj    # E
ET
Q
Q
q
147.75 87.296265 149.25 23.148926 re
W* n
q
.75 0 0 .75 152.25 92.546265 cm
/G3 gs
BT
/F4 14.666667 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <0027> Tj    # D
10.5842743 0 Td <0028> Tj     # E
9.7756042 0 Td <0036> Tj      # S
9.7756042 0 Td <0026> Tj      # C
10.5842743 0 Td <0035> Tj     # R
10.5842743 0 Td <002C> Tj     # I
4.0719757 0 Td <0033> Tj      # P
9.7756042 0 Td <0037> Tj      # T
8.9526215 0 Td <002C> Tj      # I
4.0719757 0 Td <0032> Tj      # O
11.4001007 0 Td <0031> Tj     # N
ET

# 省略部分内容......

BT
/F4 14.666667 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <0037> Tj    # T 
8.9526215 0 Td <0035> Tj      # R
10.5842743 0 Td <0036> Tj     # S
9.7756042 0 Td <0029> Tj      # F
ET
Q
q
.75 0 0 .75 152.25 152.993042 cm
/G3 gs
BT
/F4 14.666667 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <0028> Tj    # E
9.7756042 0 Td <0010> Tj      # -
4.8806458 0 Td <0025> Tj      # B
9.7756042 0 Td <0024> Tj      # A
9.7756042 0 Td <0031> Tj      # N
10.5842743 0 Td <002E> Tj     # K
9.7756042 0 Td <002C> Tj      # I
4.0719757 0 Td <0031> Tj      # N
10.5842743 0 Td <002A> Tj     # G
ET
Q
q
.75 0 0 .75 152.25 165.641968 cm
/G3 gs
BT
/F4 14.666667 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <0027> Tj    # D
10.5842743 0 Td <0025> Tj     # B
ET

BT = 开始文本
ET = 结束文本

编写一个类似于你的库的程序,使用https://github.com/ledongthuc/pdf提取在一对BT和ET之间的所有内容作为单个文本应该很简单。**困难的部分是恢复电子表格的列和行信息(即哪个文本属于哪个字段)。**因为在PDF阅读器的眼中,电子表格行只是一堆任意的线条,有时也可能是任意的矩形。

这是我编写的一个演示程序,首先找到所有的矩形,然后将所有的文本放入相应的矩形中,然后对每个字段中的文本进行排序和连接,形成最终结果。

func readPdf(path string) {
	r, err := pdf.Open(path)
	panic(err)

	// 提取所有矩形
	var fieldRects []FieldRect
	p := r.Page(1)
	c := p.Content()
	// font := p.Font(p.Fonts()[0])
	// fmt.Printf("font.Widths(): %v\n", font.Widths())
	for _, r := range c.Rect {
		fieldRects = append(fieldRects, FieldRect{
			rect:  r,
			texts: nil,
		})
	}

	// 将文本(字形)放入相应的矩形中
	for _, t := range c.Text {
		for i := range fieldRects {
			fr := &fieldRects[i]
			if fr.rect.Min.X < t.X && fr.rect.Min.Y < t.Y &&
				fr.rect.Max.X > t.X && fr.rect.Max.Y > t.Y {
				fr.texts = append(fr.texts, t)
			}
		}
	}

	// 这些值也可以通过字体大小推导出来,以获得更强的鲁棒性
	const NEWLINE_TOLERATION = 2
	// 不幸的是,你发送的PDF文件没有正确的字体宽度信息,所以这是我们在不推断字形本身的宽度信息的情况下能得到的最好结果。
	const SPACE_TOLERATION = 11

	// 按照矩形内的位置对文本(字形)进行排序,然后连接
	for i := range fieldRects {
		fr := &fieldRects[i]
		sort.Slice(fr.texts, func(i, j int) bool {
			deltaY := fr.texts[i].Y - fr.texts[j].Y
			if math.Abs(deltaY) < NEWLINE_TOLERATION { // 容忍一些垂直偏差
				return fr.texts[i].X < fr.texts[j].X // 在同一行上
			}
			return deltaY > 0 // 不在同一行上
		})
		for _, f := range fr.texts {
			if fr.lastPos != nil {
				if fr.lastPos.Y-f.Y > NEWLINE_TOLERATION { // 新行
					fr.resultText += "\n"
				}
				if f.X-fr.lastPos.X > SPACE_TOLERATION { // 空格
					fr.resultText += " "
				}
			}
			fr.resultText += f.S
			fr.lastPos = &pdf.Point{X: f.X, Y: f.Y}
		}
		if fr.resultText == "" {
			continue
		}
		fmt.Printf("====== pos: %v, %v; text: \n%s\n", fr.rect.Min, fr.rect.Max, fr.resultText)
	}
}

由于你发送的PDF文件中缺少字体宽度信息,所以很难实现可靠的空格检测。这个程序产生了一个有些可读但不是很好的结果:

====== pos: {0 0}, {794 1123}; text: 
DATE DESCRIPTION BR ENTRY BALANCE
01/07 Beginning Balance 1,000.00
02/07 TRSF 0207/DBXO 1 DB 999.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
03/07 TRSF 0307/DBXO 2 DB 997.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
04/07 TRSF 0407/DBXO 3 DB 994.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
05/07 TRSF 0507/DBXO 4 DB 990.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
06/07 TRSF 0607/DBXO 5 DB 985.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
07/07 TRSF 0707/DBXO 6 DB 979.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
08/07 TRSF 0807/DBXO 7 DB 972.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
09/07 TRSF 0907/DBXO 8 DB 964.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
Continued on next page
====== pos: {372.75 87.296265}, {447 110.44519100000001}; text: 
Continue
====== pos: {447.75 87.296265}, {522 110.44519100000001}; text: 
d on next page
====== pos: {147.75 111.19519}, {297 134.34411599999999}; text: 
X56
LAWSON
====== pos: {147.75 135.094116}, {222 183.540893}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 135.094116}, {297 208.83874500000002}; text: 
X56
LAWSON
0907/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 209.58875}, {222 258.035527}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 209.58875}, {297 283.33337900000004}; text: 
X56
LAWSON
0807/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 284.08337}, {222 332.530147}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 284.08337}, {297 357.827999}; text: 
X56
LAWSON
0707/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 358.578}, {222 407.024777}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 358.578}, {297 432.322629}; text: 
X56
LAWSON
0607/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 433.07263}, {222 481.519407}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 433.07263}, {297 506.81725900000004}; text: 
X56
LAWSON
0507/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 507.56726}, {222 556.0140369999999}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 507.56726}, {297 581.311889}; text: 
X56
LAWSON
0407/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 582.06189}, {222 630.508667}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 582.06189}, {297 655.806519}; text: 
X56
LAWSON
0307/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 656.55652}, {222 705.003297}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 656.55652}, {297 730.301149}; text: 
nce
0207/DBXO
A/SB24313/
Q0321XXXX

这种方法无法很好地工作的另一个原因是从文档中提取的矩形与PDF阅读器中单元格的外观不匹配。

**实际上,我相信如果你能找到一种方法来找到所有的单元格,并将文档页面划分为正确位置和大小的单元格(无论是手动还是自动),这种方法几乎可以完美地工作。**也许可以使用将一对BT和ET之间的任何内容视为单个文本块的技巧,而不是依赖于字形之间的相对位置来实现。

但请记住,即使你成功做到了这一点,这仍然只适用于特定的格式和PDF创建软件生成的特定“风格”,并且很可能无法很好地适用于其他情况。

还有一些商业解决方案,比如https://www.ilovepdf.com/pdf_to_excel背后的框架,我发现它在你的特定文档中效果很好。这些解决方案往往更加稳定可靠,但需要购买。其中一些可以在线使用,所以也许找到一种使用它们的API的方法可能是一个可行的选择(如果他们的服务条款允许你这样做的话)。

英文:

PDF files isn't meant to be machine-readable by design. And PDFs and how they are structured can vary quite a bit from one to another. So I doubt there will be any "solid solution" for parsing arbitrary PDF files. PDF files aren't necessary "structured" like your original spreadsheet file where the PDF might come from. It's more like vector graphics in the sense that it only contains positions and commands to plot the characters in the right place instead of containing the texts themselves.

In your case, your particular PDF file seems somewhat well-structured. Extracting the content using qpdf reveals that:

# part of the pdf content extracted, comments (#) added by me.
BT
/F4 14.666667 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <0027> Tj  # D
10.5842743 0 Td <0024> Tj   # A
8.6870575 0 Td <0037> Tj    # T
8.9526215 0 Td <0028> Tj    # E
ET
Q
Q
q
147.75 87.296265 149.25 23.148926 re
W* n
q
.75 0 0 .75 152.25 92.546265 cm
/G3 gs
BT
/F4 14.666667 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <0027> Tj    # D
10.5842743 0 Td <0028> Tj     # E
9.7756042 0 Td <0036> Tj      # S
9.7756042 0 Td <0026> Tj      # C
10.5842743 0 Td <0035> Tj     # R
10.5842743 0 Td <002C> Tj     # I
4.0719757 0 Td <0033> Tj      # P
9.7756042 0 Td <0037> Tj      # T
8.9526215 0 Td <002C> Tj      # I
4.0719757 0 Td <0032> Tj      # O
11.4001007 0 Td <0031> Tj     # N
ET

# some part skipped......

BT
/F4 14.666667 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <0037> Tj    # T 
8.9526215 0 Td <0035> Tj      # R
10.5842743 0 Td <0036> Tj     # S
9.7756042 0 Td <0029> Tj      # F
ET
Q
q
.75 0 0 .75 152.25 152.993042 cm
/G3 gs
BT
/F4 14.666667 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <0028> Tj    # E
9.7756042 0 Td <0010> Tj      # -
4.8806458 0 Td <0025> Tj      # B
9.7756042 0 Td <0024> Tj      # A
9.7756042 0 Td <0031> Tj      # N
10.5842743 0 Td <002E> Tj     # K
9.7756042 0 Td <002C> Tj      # I
4.0719757 0 Td <0031> Tj      # N
10.5842743 0 Td <002A> Tj     # G
ET
Q
q
.75 0 0 .75 152.25 165.641968 cm
/G3 gs
BT
/F4 14.666667 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <0027> Tj    # D
10.5842743 0 Td <0025> Tj     # B
ET

BT = Begin Text
ET = End Text

Writing a program similar to your library used https://github.com/ledongthuc/pdf or directly modifying the library to parse anything between one pair of BT and ET as a single text should be trivial. The hard part is recovering the column and row information of the spreadsheet (ie. which text belongs to which field). Since in the eye of the PDF reader, the spreadsheet lines are just a bunch of arbitrary lines and maybe sometimes arbitrary rectangles as well.

Here's a demo program I wrote that firstly finds all rectangles, then put all texts into their corresponding rectangles, then sort & concat the texts within each fields to form the final result.

func readPdf(path string) {
	r, err := pdf.Open(path)
	panic(err)

	// extract all rectangles
	var fieldRects []FieldRect
	p := r.Page(1)
	c := p.Content()
	// font := p.Font(p.Fonts()[0])
	// fmt.Printf("font.Widths(): %v\n", font.Widths())
	for _, r := range c.Rect {
		fieldRects = append(fieldRects, FieldRect{
			rect:  r,
			texts: nil,
		})
	}

	// put text(glyph) into their corresponding rectangles
	for _, t := range c.Text {
		for i := range fieldRects {
			fr := &fieldRects[i]
			if fr.rect.Min.X < t.X && fr.rect.Min.Y < t.Y &&
				fr.rect.Max.X > t.X && fr.rect.Max.Y > t.Y {
				fr.texts = append(fr.texts, t)
			}
		}
	}

	// these values can also be derived from font size to gain
	// even more robustness
	const NEWLINE_TOLERATION = 2
	// unfortunately the pdf you sent does not have proper font
	// width information, so this is the best we can get without
	// inferring width information from the glyph shape itself.
	const SPACE_TOLERATION = 11

	// sort text(glyph) by position within rectangles, then concat
	for i := range fieldRects {
		fr := &fieldRects[i]
		sort.Slice(fr.texts, func(i, j int) bool {
			deltaY := fr.texts[i].Y - fr.texts[j].Y
			if math.Abs(deltaY) < NEWLINE_TOLERATION { // tolerate some vertical deviation
				return fr.texts[i].X < fr.texts[j].X // on the same line
			}
			return deltaY > 0 // not on the same line
		})
		for _, f := range fr.texts {
			if fr.lastPos != nil {
				if fr.lastPos.Y-f.Y > NEWLINE_TOLERATION { // new line
					fr.resultText += "\n"
				}
				if f.X-fr.lastPos.X > SPACE_TOLERATION { // space
					fr.resultText += " "
				}
			}
			fr.resultText += f.S
			fr.lastPos = &pdf.Point{X: f.X, Y: f.Y}
		}
		if fr.resultText == "" {
			continue
		}
		fmt.Printf("====== pos: %v, %v; text: \n%s\n", fr.rect.Min, fr.rect.Max, fr.resultText)
	}
}

Due to the lack of font width information in the PDF file you sent, there's no easy way to implement reliable space detection. This program yielded a somewhat readable but not great result:

====== pos: {0 0}, {794 1123}; text: 
DATE DESCRIPTION BR ENTRY BALANCE
01/07 Beginning Balance 1,000.00
02/07 TRSF 0207/DBXO 1 DB 999.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
03/07 TRSF 0307/DBXO 2 DB 997.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
04/07 TRSF 0407/DBXO 3 DB 994.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
05/07 TRSF 0507/DBXO 4 DB 990.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
06/07 TRSF 0607/DBXO 5 DB 985.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
07/07 TRSF 0707/DBXO 6 DB 979.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
08/07 TRSF 0807/DBXO 7 DB 972.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
09/07 TRSF 0907/DBXO 8 DB 964.00
E-BANKING A/SB24313/
DB Q0321XXXX
X56
LAWSON
Continued on next page
====== pos: {372.75 87.296265}, {447 110.44519100000001}; text: 
Continue
====== pos: {447.75 87.296265}, {522 110.44519100000001}; text: 
d on next page
====== pos: {147.75 111.19519}, {297 134.34411599999999}; text: 
X56
LAWSON
====== pos: {147.75 135.094116}, {222 183.540893}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 135.094116}, {297 208.83874500000002}; text: 
X56
LAWSON
0907/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 209.58875}, {222 258.035527}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 209.58875}, {297 283.33337900000004}; text: 
X56
LAWSON
0807/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 284.08337}, {222 332.530147}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 284.08337}, {297 357.827999}; text: 
X56
LAWSON
0707/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 358.578}, {222 407.024777}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 358.578}, {297 432.322629}; text: 
X56
LAWSON
0607/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 433.07263}, {222 481.519407}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 433.07263}, {297 506.81725900000004}; text: 
X56
LAWSON
0507/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 507.56726}, {222 556.0140369999999}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 507.56726}, {297 581.311889}; text: 
X56
LAWSON
0407/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 582.06189}, {222 630.508667}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 582.06189}, {297 655.806519}; text: 
X56
LAWSON
0307/DBXO
A/SB24313/
Q0321XXXX
====== pos: {147.75 656.55652}, {222 705.003297}; text: 
TRSF
E-BANKING
DB
====== pos: {222.75 656.55652}, {297 730.301149}; text: 
nce
0207/DBXO
A/SB24313/
Q0321XXXX

Another reason for this to not work as well is the fact that the rectangles extracted from the document does not match how the cells looks in a PDF reader.

In fact I believe this could work near perfectly if you could figure out a way to find all the cells and divide the document page into correctly positioned and sized cells. (either manually or automatically). And maybe use the trick of treating anything between a pair of BT and ET as a single piece of text instead of relying on the relative positions between glyphs to do that.

But bear in mind that even if you managed to do that, this will still only work for this particular format and (for lack of a better word) "flavour" of PDF created by this particular software, and will be unlikely to work well on anything else.

There is also commercial solutions available, like the framework behind https://www.ilovepdf.com/pdf_to_excel which I find working quite well for your particular document. These solutions tends to be more robust and reliable but do cost you money to purchase. Some of them can be used online though so maybe figuring a way to use their api might be a viable alternative. (if their terms of service allows you to do so, that is.)

huangapple
  • 本文由 发表于 2022年8月25日 17:36:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/73485235.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定