如何在golang中从PDF中提取纯文本

huangapple go评论85阅读模式
英文:

How to extract plain text from PDF in golang

问题

我想使用Go从PDF文件中提取文本。我尝试使用ledongthuc/pdf Go包,该包实现了GetPlainText()方法,用于获取无格式的纯文本内容。但是我没有得到纯文本。我的结果如下:

 W
 S
 D
 V
 Y R
 O
 R
 Q
 W
 D
 L
 U
 H
 P
 H
 Q
 W
......

Go代码

package main

import (
    "bytes"
    "fmt"

    "github.com/ledongthuc/pdf"
)

func main() {
    content, err := readPdf("test.pdf")
    if err != nil {
        panic(err)
    }
    fmt.Println(content)
    return
}

func readPdf(path string) (string, error) {
    r, err := pdf.Open(path)
    if err != nil {
        return "", err
    }
    totalPage := r.NumPage()

    var textBuilder bytes.Buffer
    for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
        p := r.Page(pageIndex)
        if p.V.IsNull() {
            continue
        }
        textBuilder.WriteString(p.GetPlainText("\n"))
    }
    return textBuilder.String(), nil
}
英文:

I want to extract text from pdf file using GO.
I tried using ledongthuc/pdf Go package that implement the method GetPlainText() to get plain text content without format.
But I don't get the plain text. I have as a result:

 W
 S
 D
 V
 Y R
 O
 R
 Q
 W
 D
 L
 U
 H
 P
 H
 Q
 W
......

Go code

package main

import (
    &quot;bytes&quot;
    &quot;fmt&quot;

    &quot;github.com/ledongthuc/pdf&quot;
)

func main() {
    content, err := readPdf(&quot;test.pdf&quot;)
    if err != nil {
        panic(err)
    }
    fmt.Println(content)
    return
}

func readPdf(path string) (string, error) {
    r, err := pdf.Open(path)
    if err != nil {
        return &quot;&quot;, err
    }
    totalPage := r.NumPage()

    var textBuilder bytes.Buffer
    for pageIndex := 1; pageIndex &lt;= totalPage; pageIndex++ {
        p := r.Page(pageIndex)
        if p.V.IsNull() {
            continue
        }
        textBuilder.WriteString(p.GetPlainText(&quot;\n&quot;))
    }
    return textBuilder.String(), nil
}

答案1

得分: 2

你可以将消息更改为“Exemple of a pdf document.”,而不是

Ex
a
m
pl
e

of

a

pd
f

doc
u
m
e
nt
.

你需要做的是将textBuilder.WriteString(p.GetPlainText("\n"))更改为

textBuilder.WriteString(p.GetPlainText(""))

希望这可以帮到你。

英文:

You can have a message such as "Exemple of a pdf document." instead of

Ex
a
m
pl
e

of

a

pd
f

doc
u
m
e
nt
.

What you need to do is change the textBuilder.WriteString(p.GetPlainText(&quot;\n&quot;))
to

textBuilder.WriteString(p.GetPlainText(&quot;&quot;))

I hope this helps.

huangapple
  • 本文由 发表于 2017年6月15日 14:34:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/44560265.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定