英文:
How to extract text from pdf using golang?
问题
我正在尝试从一个PDF文件中提取文本,使用的是golang语言。请参考下面的代码。由于某种原因,它打印出了一些乱码(一些随机数字)。这里是PDF文件的链接。我相信可以提取文本,因为我能够从该文件中复制和粘贴文本。
package main
import (
"bufio"
"bytes"
"fmt"
"io/ioutil"
"log"
"net/http"
"os"
"strings"
pdf "github.com/unidoc/unipdf/v3/model"
)
func main() {
fmt.Println("请输入PDF文件的URL:")
reader := bufio.NewReader(os.Stdin)
url, err := reader.ReadString('\n')
if err != nil {
log.Fatal(err)
}
url = strings.TrimSpace(url)
// 从URL获取PDF文件。
resp, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
buf, _ := ioutil.ReadAll(resp.Body)
pdfReader, err := pdf.NewPdfReader(bytes.NewReader(buf))
if err != nil {
log.Fatal(err)
}
// 解析PDF文件。
isEncrypted, err := pdfReader.IsEncrypted()
if err != nil {
log.Fatal(err)
}
// 如果PDF文件被加密,则退出并显示错误消息。
if isEncrypted {
fmt.Println("错误:PDF文件已加密。")
os.Exit(1)
}
// 获取页面数量。
numPages, err := pdfReader.GetNumPages()
if err != nil {
log.Fatal(err)
}
// 遍历页面并打印文本。
for i := 1; i <= numPages; i++ {
page, err := pdfReader.GetPage(i)
if err != nil {
log.Fatal(err)
}
text, err := page.GetAllContentStreams()
if err != nil {
log.Fatal(err)
}
fmt.Println(text)
}
}
英文:
I am trying to extract text from a pdf file in golang. See the code below. For some reason, it's printing complete garbage(some random numbers). Here is the pdf. I believe it's possible to extract text since I am able to copy and paste the text from this file.
package main
import (
"bufio"
"bytes"
"fmt"
"io/ioutil"
"log"
"net/http"
"os"
"strings"
pdf "github.com/unidoc/unipdf/v3/model"
)
func main() {
fmt.Println("Enter URL of PDF file:")
reader := bufio.NewReader(os.Stdin)
url, err := reader.ReadString('\n')
if err != nil {
log.Fatal(err)
}
url = strings.TrimSpace(url)
// Fetch PDF from URL.
resp, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
buf, _ := ioutil.ReadAll(resp.Body)
pdfReader, err := pdf.NewPdfReader(bytes.NewReader(buf))
if err != nil {
log.Fatal(err)
}
// Parse PDF file.
isEncrypted, err := pdfReader.IsEncrypted()
if err != nil {
log.Fatal(err)
}
// If PDF is encrypted, exit with message.
if isEncrypted {
fmt.Println("Error: PDF is encrypted.")
os.Exit(1)
}
// Get number of pages.
numPages, err := pdfReader.GetNumPages()
if err != nil {
log.Fatal(err)
}
// Iterate through pages and print text.
for i := 1; i <= numPages; i++ {
page, err := pdfReader.GetPage(i)
if err != nil {
log.Fatal(err)
}
text, err := page.GetAllContentStreams()
if err != nil {
log.Fatal(err)
}
fmt.Println(text)
}
}
答案1
得分: 3
GetAllContentStreams
方法可能会返回页面中的格式、图形、图像和其他对象,这可能是打印出完全无用的内容(一些随机数字)的原因。
我们可以使用ExtractText
方法来提取文本,而不是使用GetAllContentStreams
。
使用这个包需要一个许可证API密钥。
https://github.com/unidoc/unipdf
这个软件包(unipdf)是一个商业产品,需要许可证码才能运行。
要在免费层级中获得计量许可证API密钥,请在https://cloud.unidoc.io上注册。
unipdf示例代码可以在这里找到。
以下是更新后的代码:
func init() {
// 在使用库之前,请确保加载您的计量许可证API密钥。
// 如果您需要一个密钥,可以在https://cloud.unidoc.io上注册并创建一个免费密钥。
err := license.SetMeteredKey("your-metered-api-key")
if err != nil {
panic(err)
}
}
func main() {
//
// 你代码中的其他块
//
// 遍历页面并打印文本。
for i := 1; i <= numPages; i++ {
pageNum := i + 1
page, err := pdfReader.GetPage(i)
if err != nil {
log.Fatal(err)
}
ex, err := extractor.New(page)
if err != nil {
log.Fatal(err)
}
text, err := ex.ExtractText()
if err != nil {
log.Fatal(err)
}
fmt.Println("------------------------------")
fmt.Printf("Page %d:\n", pageNum)
fmt.Printf(text)
fmt.Println("------------------------------")
}
}
英文:
It is possible for GetAllContentStreams
might returns formats, graphics, images, and other objects in that page and that might be the reason for printing complete garbage(some random numbers).
> GetAllContentStreams gets all the content streams for a page as one
> string
Instead of GetAllContentStreams
, we can use ExtractText
method to extract the text.
> ExtractText processes and extracts all text data in content streams
> and returns as a string.
And this should need a licence api key to use the package.
https://github.com/unidoc/unipdf
> This software package (unipdf) is a commercial product and requires a
> license code to operate.
>
> To Get a Metered License API Key in for free in the Free Tier, sign up
> on https://cloud.unidoc.io
The unipdf example code can be found at here
Here is the updated code
func init() {
// Make sure to load your metered License API key prior to using the library.
// If you need a key, you can sign up and create a free one at https://cloud.unidoc.io
err := license.SetMeteredKey("your-metered-api-key")
if err != nil {
panic(err)
}
}
func main() {
//
// The other blocks in your code
//
// Iterate through pages and print text.
for i := 1; i <= numPages; i++ {
pageNum := i + 1
page, err := pdfReader.GetPage(i)
if err != nil {
log.Fatal(err)
}
ex, err := extractor.New(page)
if err != nil {
log.Fatal(err)
}
text, err := ex.ExtractText()
if err != nil {
log.Fatal(err)
}
fmt.Println("------------------------------")
fmt.Printf("Page %d:\n", pageNum)
fmt.Printf(text)
fmt.Println("------------------------------")
}
}
答案2
得分: 2
我找不到一个免费且功能强大的Go包来从PDF中提取文本。幸运的是,有一些免费的命令行工具可以做到这一点。
Xpdf的pdftotext
是一个很有前途的选择。看一下它的输出:
$ pdftotext -layout -nopgbrk 2023-04-24_BU-12.pdf - | head
ALL INDIA TENNIS ASSOCIATION
As on 24TH April , 2023
BOY'S UNDER-12 2011 BEST BEST 25% BEST POINTS
24TH April , 2023 Eight Eight Eight CUT FOR TTL.
SING. DBLS. DBLS. NO SHOW PTS.
RANK NAME OF PLAYER REG NO. DOB STATE PTS. PTS. PTS. LATE WL Final
1 VIVAAN MIRDHA 432735 08-Apr-11 (RJ) 485 565 141.25 0 797
2 SMIT SACHIN UNDRE 437763 07-Feb-11 (MH) 435 480 120 0 664.25
3 RISHIKESH MANE 436806 15-Jan-11 (MH) 420 380 95 0 619
4 VIRAJ CHOUDHARY 436648 03-Feb-11 (DL) 415 420 105 0 598.75
在Ubuntu上,可以使用以下命令安装这个工具:
$ sudo apt install poppler-utils
使用exec
包可以很容易地从Go应用程序中执行它:
package main
import (
"bytes"
"context"
"fmt"
"os/exec"
)
func main() {
// 更多选项请参阅“man pdftotext”。
args := []string{
"-layout", // 尽量保持原始文本的物理布局。
"-nopgbrk", // 在页面之间不插入分页符(换页符)。
"2023-04-24_BU-12.pdf", // 输入文件。
"-", // 将输出发送到stdout。
}
cmd := exec.CommandContext(context.Background(), "pdftotext", args...)
var buf bytes.Buffer
cmd.Stdout = &buf
if err := cmd.Run(); err != nil {
fmt.Println(err)
return
}
fmt.Println(buf.String())
}
英文:
I can not find a free, capable Go package to extract text from PDF. Luckily, there are some free CLI tools that can do this.
pdftotext
from Xpdf is a promising choice. See its output:
$ pdftotext -layout -nopgbrk 2023-04-24_BU-12.pdf - | head
ALL INDIA TENNIS ASSOCIATION
As on 24TH April , 2023
BOY'S UNDER-12 2011 BEST BEST 25% BEST POINTS
24TH April , 2023 Eight Eight Eight CUT FOR TTL.
SING. DBLS. DBLS. NO SHOW PTS.
RANK NAME OF PLAYER REG NO. DOB STATE PTS. PTS. PTS. LATE WL Final
1 VIVAAN MIRDHA 432735 08-Apr-11 (RJ) 485 565 141.25 0 797
2 SMIT SACHIN UNDRE 437763 07-Feb-11 (MH) 435 480 120 0 664.25
3 RISHIKESH MANE 436806 15-Jan-11 (MH) 420 380 95 0 619
4 VIRAJ CHOUDHARY 436648 03-Feb-11 (DL) 415 420 105 0 598.75
On Ubuntu, this tool can be installed with this command:
$ sudo apt install poppler-utils
And it's easy to execute it from Go application with the exec
package:
package main
import (
"bytes"
"context"
"fmt"
"os/exec"
)
func main() {
// See "man pdftotext" for more options.
args := []string{
"-layout", // Maintain (as best as possible) the original physical layout of the text.
"-nopgbrk", // Don't insert page breaks (form feed characters) between pages.
"2023-04-24_BU-12.pdf", // The input file.
"-", // Send the output to stdout.
}
cmd := exec.CommandContext(context.Background(), "pdftotext", args...)
var buf bytes.Buffer
cmd.Stdout = &buf
if err := cmd.Run(); err != nil {
fmt.Println(err)
return
}
fmt.Println(buf.String())
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论